CN111582044B

CN111582044B - Face recognition method based on convolutional neural network and attention model

Info

Publication number: CN111582044B
Application number: CN202010295613.8A
Authority: CN
Inventors: 贺前华; 杨泽睿; 庞文丰
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2023-06-20
Anticipated expiration: 2040-04-15
Also published as: CN111582044A

Abstract

The invention relates to a face recognition technology, in particular to a face recognition method based on a convolutional neural network and an attention model, which comprises the following steps: preprocessing a face image; inputting the preprocessed image data into a convolutional neural network to extract high-dimensional features; the high-dimensional characteristics are input into an attention model, an attention mask is calculated through a training method of a neural network, and the attention distribution characteristics M (F) of a space domain and a channel domain are obtained _c ) The method comprises the steps of carrying out a first treatment on the surface of the Attention distribution features M (F _c ) Is input into the Bottleneck module, and features H (M (F) _c ) A) is provided; features H (M (F) _c ) Inputting the face recognition result into a full convolution network classification structure, and obtaining a final face recognition result by using a Dropout strategy and a softmax function. The invention adopts the Attention model and the Bottleneck module to replace a medium-high layer network in the VGG model, greatly reduces the number of parameters required by the model, and effectively reduces the video memory and the time consumption during training; meanwhile, the learning capacity of the network to the data is effectively improved, and the stability and the practicability of the system are improved.

Description

Face recognition method based on convolutional neural network and attention model

Technical Field

The invention relates to a face recognition technology, in particular to a face recognition method based on a convolutional neural network and an attention model.

Background

Face recognition refers to a technique of identifying or verifying one or more faces by extracting face information in a still image or video sequence and comparing with information in an existing face database. Compared with biological characteristic recognition means such as signatures, fingerprints and the like, the information extraction of the image-based identity recognition system is more convenient. Is one of the most active research directions in the fields of computer vision and pattern recognition due to its wide application direction and commercial value. In recent years, a face identification network based on deep learning has achieved remarkable results. Meanwhile, attention models are widely used in various fields of deep learning. The visual attention mechanism mimics the brain signal processing mechanism that is characteristic of human vision. The target area needing to be focused is obtained by rapidly scanning the global image, and then more attention resources are put into the area so as to obtain more detail information of the target needing to be focused and inhibit other useless information. The method greatly improves the efficiency and accuracy of visual information processing.

In recent years, research work in combination with visual attention mechanisms has mostly focused on using masks to create attention mechanisms. The principle of the mask is that key features in the picture data are identified through another layer of new weight, and through learning training, the deep neural network learns the region needing to be focused in each new picture, so that the attention is formed. There are two major categories of attention: soft attention (soft attention) and strong attention (hard attention). Attention is paid to a random predictive process, more emphasis is placed on dynamic changes, and is not trivial, training generally requiring reinforcement learning. The key to soft attention is that it can be made tiny, i.e. the gradient can be calculated, and can be trained using neural network methods. In the CNN-based attention model, two attention domains, spatial domain (spatial domain) and channel domain (channel domain), are mainly divided. The attention in the spatial domain is mainly distributed in space, and the attention is expressed on an image as different attention degrees of different areas on the image, and is reflected in mathematics, namely, for a characteristic diagram with a certain size of C multiplied by W multiplied by H, one effective spatial attention corresponds to a matrix with a size of W multiplied by H, and each position is a weight for a pixel at the position corresponding to the original characteristic diagram. The method integrates and compresses the multi-channel values of the same pixel, and derives the pixel when calculating the attention weight, so that the spatial attention of the feature can be obtained. The attention of the channel domain is mainly distributed in the channels, and the attention of different image channels is different in the image. The method is reflected in mathematics, namely, for a feature map with a certain size of C multiplied by W multiplied by H, attention of a certain channel corresponds to a matrix with a size of C multiplied by 1, and each position is weighted on all pixels of the channel corresponding to the original feature map. The method integrates and compresses a plurality of pixel values of the same channel, and derives the channel when calculating the attention weight, so as to obtain the characteristic channel attention.

The existing face recognition method based on the deep learning and attention model uses training samples to automatically learn face features, and can extract excellent face features with distinguishing degrees. However, an efficient neural network often requires a large number of parameters, and the computational resources consumed in training and testing are large, which also makes migration, reloading, and embedded design of the network difficult. It is therefore necessary to find a design solution for miniaturized networks.

Disclosure of Invention

The invention provides a face recognition method based on a convolutional neural network and an attention model, which aims to solve the problem that the existing face recognition neural network generally consumes large computing resources, thereby reducing the requirement of a network model on equipment and improving the computing speed; by using the attention model and the Bottleneck structure, the defect that the shallow network is difficult to extract the characteristics with abundant semantic information is overcome on the basis of further reducing network parameters, and the recognition task is completed by combining the semantic information of each scale, so that a recognition result with high accuracy is obtained.

The invention is realized by the following technical scheme: the face recognition method based on the convolutional neural network and the attention model comprises the following steps:

s1, preprocessing a face image;

s2, inputting the preprocessed image data into a convolutional neural network to extract high-dimensional features, and obtaining high-dimensional features F _c ；

S3, high-dimensional characteristic F _c Input into an attention model, and calculate an attention mask by a training method of a neural network to obtain attention distribution characteristics M (F) of a spatial domain and a channel domain _c )；

S4, the attention distribution characteristics M (F _c ) Is input into the Bottleneck module, and features H (M (F) _c ))；

S5, feature H (M (F) _c ) Inputting the face recognition result into a full convolution network classification structure, and obtaining a final face recognition result by using a Dropout strategy and a softmax function.

According to the technical scheme, the Attention model (Attention model) and the Bottleneck module are adopted to replace a middle-high layer network in the VGG model, so that the number of parameters required by the model is greatly reduced, and the video memory and time consumption during training are effectively reduced; meanwhile, the learning capacity of the network to the data is effectively improved, and the stability and the practicability of the system are improved. Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the integral neural network in the invention remarkably reduces the parameter quantity while retaining the simple structural characteristics and good generalization performance of VGG, greatly reduces the video memory and the memory occupation during training, and is also beneficial to embedded design by a very small network model.

2. The attention module in the invention is based on one-dimensional convolution design, and on the basis of further reducing the parameter quantity, the full use of the pixel domain and channel domain information is realized.

3. According to the invention, the Bottleneck module is based on one-dimensional convolution and the design of a single Batchnorm layer, so that the calculation time is reduced on the basis of further reducing the number of parameters, and the capture of multidimensional information is completed.

4. The invention has the quick and efficient learning ability for the local data set, and still has high performance when processing the data with illumination difference and posture change.

Drawings

Fig. 1 is a schematic flow chart of a face recognition method based on a convolutional neural network and an attention model according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a convolutional neural network feature extraction module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an attention model and a Bottlecnk module according to an embodiment of the present invention;

fig. 4 is an application schematic diagram of a face recognition neural network module based on a convolutional neural network and an attention model according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment provides a face recognition method based on a convolutional neural network and an Attention model, which adopts an Attention model and a Bottleneck module to process the output characteristics of the shallow convolutional neural network and establish high-dimensional semantic characteristics comprising pixel domain and channel domain information; as shown in fig. 1, the method specifically includes the following steps:

1) Preprocessing face images: wherein the preprocessing comprises converting RGB face image into gray image, randomly turning over the image, normalizing image tensor, normalizing image pixel size into gray image

Normalized to 256×256 in this embodiment;

2) Convolutional Neural Network (CNN) feature extraction: inputting the image data preprocessed in the step 1) into a convolutional neural network to extract high-dimensional features to obtain high-dimensional features F _c The convolutional neural network structure adopts the design of improving the VGG network;

3) Attention (Attention) model: high-dimensional feature F output from step 2) _c Input into a soft attention model, calculate an attention mask by a training method of a neural network to obtain attention distribution characteristics M (F) of a spatial domain and a channel domain _c ) Wherein the attention model is based on a CNN network design;

4) Bottleneck Module: the attention distribution characteristics M (F) output in the step 3) _c ) Is input into the Bottleneck module, and features H (M (F) _c ) A) wherein the Bottleneck module is based on one-dimensional convolution;

5) Full Convolutional Network (FCN) classification: the characteristic H (M (F) _c ) Input)And obtaining a final face recognition result in the FCN classification structure. The FCN classification structure uses Dropout strategy and softmax function.

Further, in the step 1) of face image preprocessing, the RGB image is converted into a gray image, specifically:

after converting an RGB image into a gray image, expanding the single-channel gray image into an N-channel gray image, wherein an image conversion calculation formula and a channel expansion formula are as follows:

Gray _i ＝λ _i Gray+bias _i

wherein R, G and B respectively represent RGB three channel pixel values of the original image. i denotes the i-th gray scale channel,

bias _i for bias item->

Gray _i ∈[0,255]. In the present embodiment, the number of channels n=3, λ ₁ 、λ ₂ 、λ ₃ Taking 0.37, 0.30 and 0.33 respectively; bias (BIAs) _i Is a randomly generated value.

Further, the normalization of the image tensor in step 1) is as follows:

wherein X is _i Input data for i channel, E (X _i ) Is the mean value, σ (X _i ) Is the standard deviation. In the present embodiment, since the picture is mapped to the (0, 1) space when performing tensor conversion, in order to reduce the calculation amount and obtain more resolved input data, E (X) _i ) =0.5 and σ (X _i ) =0.5, X _i Mapping to (-1, 1) space allows negative samples to be generated.

As shown in fig. 2, the normalized face image data is input into a convolutional neural network to extract target dimensional features. In step 2), the input image data is downsampled using batch normalization (batch normalization) after passing through the first max pooling layer, picture downsampling using max pooling after each convolution layer, the number of convolution layers being 5, and the contracted feature size w×h (8×8 in this embodiment) being the original picture

Output high-dimensional feature F _c The form is as follows: (C) _c ,W _c ,H _c ) The method comprises the steps of carrying out a first treatment on the surface of the The mean and variance of the batch normalization were as follows:

in which W is _h To convolve the neural network parameters, x _i For convolutional neural network input, m is the batch size. In this embodiment, the batch size value 32, the output characteristic F _c The form is as follows: (32, 128,8,8) discarding bias term bias when performing batch normalization calculations.

As shown in fig. 3, attention information is added for different channels and different pixel points of the output high-dimensional characteristics by using an Attention model, and information of different dimensions is further captured by using a Bottleneck module, so that the performance of the convolutional neural network is improved, specifically:

1. the size of the output from the step 2) is (C) _c ,W _c ,H _c ) Is spliced into a three-dimensional feature of length C _c ×W _c ×H _c Is input vector V of (2) _c The method comprises the steps of carrying out a first treatment on the surface of the The three-dimensional feature in this embodiment has a size (128,8,8), vector V _c Length 128×8×8=8192;

2. associating input using fully connected network structureVector V _c And a length W _c ×H _c Output vector V of (2) _p Establishing the relation between each high-dimensional characteristic pixel point and other pixel points and channels; in the present embodiment, the output vector V _p Length 8×8=64;

3. generating a attention feature map sigma from the convolution output through a sigmoid activation function, wherein the attention feature map is 8×8=64 in size, and reconstructing vectors into 8×8 image data when mask display is performed;

4. combining the obtained attention profile with the output profile F of step 2) _c Tensor multiplication is performed to form attention enhancement to the spatial domain and the channel domain, and the attention enhanced features M (F _c )：

M(F _c )＝ψ(f ^C×W×H (Contact(F _c )))＝ψ(f ^C×W×H (V _c ))

In the formula, contact () is matrix splicing, f ^C×W×H () For the characteristic vector V _c The network computation performed, ψ (), is the tensor multiplication (mask) computation of the attention profile. In this embodiment, the calculation is based on a broadcast mechanism (board casting) using tensor multiplication (mask); specifically, the high-dimensional feature scale is (32, 128, 64), and the mask with the scale of (32, 1, 64) is multiplied by the corresponding element with the scale of (32, 64) of each high-dimensional feature of 128 channels to obtain a feature M (F) _c ) Is (32, 128, 64).

Further, the procedure of obtaining the characteristics by the Bottleneck module in the step 4) through using a Shortcut mechanism is specifically as follows:

1. feature M (F) output from step 3) is convolved using a 1×1 one-dimensional convolution _c ) Performing dimension reduction and fusing information among different channels; the number of channels is reduced from 128 to 32 through dimension reduction operation in the embodiment;

2. zero padding is carried out on two ends of the feature vector after dimension reduction, then one-dimensional convolution with a convolution kernel of 3 is carried out, namely 3X 3 convolution is carried out, and the obtained feature vector has the dimensions of (32, 32, 64);

3. again 1 x 1 convolutions are performed to recover the dimensions, resulting in a feature F of scale (32, 128, 64) _B (M(F _c ))；

4. Output characteristics H _B (M(F _c ) And) wherein H _B (M(F _c ))＝F _B (M(F _c ))+M(F _c ) In this embodiment, matrix alignment element addition is performed, and characteristic H _B (M(F _c ) Scale (32, 128, 64).

That is, the Bottenneck structure is characterized by the characteristics M (F _c ) After 1×1 convolution, batch normalization is used, and then one-dimensional convolution with convolution kernel 3 and 1×1 convolution and ReLU activation function are used successively to obtain feature F _B (M(F _c ))。

Referring to fig. 4, the fcn classifies the data output by the bottleneck using a Dropout strategy and a softmax function to obtain a final recognition result. In this embodiment, two full-connection layers are adopted, the number of neurons is 512, and the dropout coefficient takes a value of 0.2.

The above-mentioned embodiments are only preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and those skilled in the art can substitute or change the technical solution and the inventive conception of the present invention equally within the scope of the disclosure of the present invention.

Claims

1. The face recognition method based on the convolutional neural network and the attention model is characterized by comprising the following steps of:

s1, preprocessing a face image;

S5, feature H (M (F) _c ) Inputting the face recognition result into a full convolution network classification structure, and obtaining a final face recognition result by using a Dropout strategy and a softmax function;

step S3 comprises the steps of:

s31, the size of the output is (C _c ,W _c ,H _c ) Is spliced into a length C _c ×W _c ×H _c Is input vector V of (2) _c ；

S32, associating input vectors V by adopting full-connection network structure _c And a length W _c ×H _c Output vector V of (2) _p Establishing the relation between each high-dimensional characteristic pixel point and other pixel points and channels;

s33, generating an attention feature map through a sigmoid activation function by convolution output;

s34, combining the attention characteristic diagram with the high-dimensional characteristic F output in the step S2 _c Tensor multiplication is performed to form attention enhancement to the spatial domain and the channel domain, and the attention enhanced features M (F _c )：

M(F _c )＝ψ(f ^C×W×H (Contact(F _c )))＝ψ(f ^C×W×H (V _c ))

In the formula, contact () is matrix splicing, f ^C×W×H () For the characteristic vector V _c Performing network calculation, wherein psi () is tensor multiplication calculation of the attention feature map;

the Bottleneck module in step S4 uses the Shortnut mechanism to obtain features based on one-dimensional convolution, which includes:

s41, feature M (F _c ) Performing dimension reduction;

s42, carrying out zero padding treatment on two ends of the feature vector after dimension reduction, and then carrying out one-dimensional convolution with a convolution kernel of 3;

s43, carrying out 1X 1 convolution again, recovering dimension and obtaining feature F _B (M(F _c ))；

S44, output feature H _B (M(F _c ) And) wherein H _B (M(F _c ))＝F _B (M(F _c ))+M(F _c )。

2. The face recognition method based on convolutional neural network and attention model of claim 1, wherein the preprocessing of step S1 comprises: normalizing face image size to

The face image is randomly turned left and right, the RGB image is converted into a gray image, and the face image tensor is standardized.

3. The face recognition method based on the convolutional neural network and the attention model according to claim 2, wherein the RGB image is converted into a gray image, specifically:

Gray _i ＝λ _i Gray+bias _i

wherein R, G and B respectively represent RGB three-channel pixel values of the original image; i denotes the i-th gray scale channel,

bias _i for bias item->

Gray _i ∈[0,255]。

4. A face recognition method based on convolutional neural network and attention model as recited in claim 3, wherein the number of channels n=3, λ ₁ 、λ ₂ 、λ ₃ Taking 0.37, 0.30 and 0.33 respectively; bias (BIAs) _i Is a randomly generated value.

5. The face recognition method based on convolutional neural network and attention model according to claim 2, wherein the face image tensor is normalized as follows:

wherein X is _i Input data for i channel, E (X _i ) Is the mean value, σ (X _i ) Is the standard deviation.

6. The face recognition method based on convolutional neural network and attention model as claimed in claim 1, wherein the image data input in step S2 is normalized in batch after passing through the first max pooling layer, picture downsampling is performed after each convolutional layer using max pooling, the number of convolutional layers is 5, and the contracted feature size W x H is the original picture

in which W is _h To convolve the neural network parameters, x _i For convolutional neural network input, m is the batch size.