CN114360030A

CN114360030A - Face recognition method based on convolutional neural network

Info

Publication number: CN114360030A
Application number: CN202210049539.0A
Authority: CN
Inventors: 李琦; 赖艳
Original assignee: Chongqing Ruiyun Technology Co ltd
Current assignee: Chongqing Ruiyun Technology Co ltd
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-04-15

Abstract

The invention discloses a face recognition method based on a convolutional neural network, which comprises the steps of training a face recognition network by utilizing a training data set, inputting a face image into the trained face recognition network, performing global average pooling operation on an intermediate feature map by utilizing a GAP layer, sequentially passing a primary feature vector through a full connection layer and a softmax layer, and finally outputting to obtain a classification result. The face recognition network comprises a DSAG unit, a GM pooling layer, a GAP layer, a full connection layer and a softmax layer. According to the invention, the global median pooling operation layers are arranged in the two spatial attention modules, and part of the calibration graph in the first spatial attention module is input into the second spatial attention module, so that the second calibration graph has a calibration effect on feature information of different scales, and the DSAG unit can more fully extract useful feature information from the low-resolution face image.

Description

Face recognition method based on convolutional neural network

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a face recognition method based on a convolutional neural network.

Background

With the proposal of various deep learning models, the accuracy of the convolutional neural network on the face recognition task is gradually improved, and the existing technology can well meet the identity recognition requirement in urban public traffic scenes from the current practical floor application projects (such as face brushing and gate passing, face brushing and ticket checking, an access control system based on face recognition and the like). However, for some tourist attractions, the crowd has a large and scattered range of motion, the infrastructure construction cost is high, and a large number of cameras cannot be arranged to collect face images like a city. In the application scene, monitoring equipment can be only arranged at key positions, the face image only occupies a small part of the acquired image, the resolution of the face image is low, and the existing model identification precision is not high under the condition.

Disclosure of Invention

In view of the above-mentioned deficiencies in the prior art, the present invention provides a face recognition method based on a convolutional neural network to more accurately recognize a low-resolution face image.

In order to achieve the above purpose, the solution adopted by the invention is as follows: a face recognition method based on a convolutional neural network comprises the following steps:

s10, building a face recognition network, and training the face recognition network by using a training data set;

the face recognition network comprises a DSAG unit, a GM pooling layer, a GAP layer, a full connection layer and a softmax layer, wherein the DSAG unit is used for extracting feature information in an image, the DSAG unit and the GM pooling layer are multiple, and the DSAG unit and the GM pooling layer are alternately arranged along the depth direction of the face recognition network;

s20, acquiring a face image to be recognized, inputting the face image into the trained face recognition network, and obtaining an intermediate feature map after the face image sequentially passes through the DSAG units and the GM pooling layer;

s30, carrying out global average pooling operation on the intermediate feature graph by utilizing the GAP layer to obtain a primary feature vector;

s40, enabling the primary feature vector to sequentially pass through the full-connection layer and the softmax layer, and finally outputting to obtain a classification result to finish face recognition;

wherein the DSAG unit may be represented by the following mathematical model:

T1＝f_RC ¹(Y_n-1)

T2＝f_SA ¹(T1)*T1

T3＝f_RC ²(T2)

Y_n＝f_SA ²(T3,U)*T3

wherein, Y_n-1And Y_nRespectively representing the input and output of the DSAG unit, f_RC ¹() Representing a first feature extraction component, f_RC ²() Representing a second feature extraction component, f_SA ¹() Representing a first spatial attention module, f_SA ²() Represents a second spatial attention module, U represents a calibration graph input to the second spatial attention module from the first spatial attention module, and T3 and U are simultaneously used as input of the second spatial attention module.

Further, the first feature extraction component and the second feature extraction component each include a plurality of convolution residual blocks connected in sequence, and the convolution residual blocks can be expressed by the following formula;

M_n＝λ₂(f₂(λ₁(f₁(M_n-1))))+M_n-1

wherein M is_n-1And M_nRespectively representing the input and output of said block of convolution residues, f₁And f₂All represent convolution layers with a convolution kernel size of 3 x3, λ₁And λ₂Both represent the activation function ReLU.

Further, the first spatial attention module may be represented by the following formula:

Z1＝θ1(f_CS ¹(<AP1(T1),MP1(T1),DP1(T1)>))

wherein T1 is a feature map input to the first spatial attention module, Z1 represents a first calibration map of the first spatial attention module output, f_CS ¹Denotes convolution operation with convolution kernel size 1 x1, θ 1 denotes sigmoThe id activates a function that is to be executed,<·>represents performing a splicing operation, AP1() represents an average pooling operation layer, MP1() represents a maximum pooling operation layer, and DP1() represents a median pooling operation layer.

Further, the second spatial attention module may be represented by the following mathematical model:

X1＝AP1(T1)+MP1(T1)-AP2(T3)

X2＝MP1(T1)+MP2(T3)

X3＝θ2(f_CS ²(<X1,X2,AP2(T3),DP1(T1),DP2(T3)>))

wherein T1 is a feature map input to the first spatial attention module, T3 is a feature map input to the second spatial attention module, X3 represents a second calibration map output by the second spatial attention module, AP1() and AP2() respectively represent average pooling levels in the first and second spatial attention modules, MP1() and MP2() respectively represent maximum pooling levels in the first and second spatial attention modules, DP1() and DP2() respectively represent median pooling levels (median), f_CS ²Represents the convolution operation with convolution kernel size of 1 x1, theta 2 represents sigmoid activation function,<·>indicating that a splicing operation is performed. The maximum pooling operation layer, the average pooling operation layer and the median pooling operation layer are all calibration graphs which operate the channel direction of the feature graph, the output is the number of channels is 1, and the length and the width are the same as the input.

Further, the pooling window size of the GM pooling layer is 3 x3, the step size is 2, and the operation of the GM pooling layer can be represented as the following mathematical model:

K1＝sort(P)

K2＝Avg(max1(K1)+max2(K1)+max3(K1))+max1(K1)

where P is a matrix of 3 × 3 size input to the GM pooling layer, sort (P) represents sorting elements in the matrix P from large to small, max1(K1) represents obtaining the value of an element located at the first position in the front end of the number series K1 (i.e., the maximum value), max2(K1) represents obtaining the value of an element located at the second position in the number series K1, max3(K1) represents obtaining the value of an element located at the third position in the number series K1, and Avg () represents an averaging operation. By filling the edges of the feature map, the length and width dimensions of the feature map become half of the original dimensions after passing through the GM pooling layer.

The invention has the beneficial effects that:

(1) in the current face recognition model based on the convolutional neural network, the attention mechanism adopts average pooling and maximum pooling, takes the characteristics of a face image into consideration, fully and accurately extracts edge information in the face image, and plays an important role in improving the face recognition accuracy, so that the invention arranges a global median pooling operation layer in two spatial attention modules, so that the amount of obtaining edge characteristic information in the characteristic information can be improved in the process of calibrating a characteristic diagram by a spatial calibration diagram, thereby improving the accuracy of low-resolution face recognition;

(2) with the increase of the network depth and the gradual increase of the receptive field of convolution operation, the invention inputs part of the calibration graph in the first spatial attention module into the second spatial attention module, thus being capable of improving the receptive range of the calibration graph generated by the second spatial attention module, and leading the second calibration graph to have a calibration effect on feature information of different scales, rather than only aiming at feature information of the same scale as the T3 feature graph, and by means of the method, the DSAG unit can more fully extract useful feature information from the low-resolution face image;

(3) in a conventional classification network, the feature map is processed by adopting maximum pooling operation, so that although the operation is simple, the utilization rate of the features is low, and especially under the condition of low image resolution, less effective information which is originally available is easily lost.

Drawings

FIG. 1 is a schematic diagram of a face recognition network according to an embodiment;

fig. 2 is a schematic diagram of an internal structure of a DSAG unit in the face recognition network shown in fig. 1;

FIG. 3 is a diagram illustrating the structure of a bit convolution residual block in the DSAG unit shown in FIG. 2;

FIG. 4 is a schematic diagram of the first spatial attention module and the second spatial attention module of the DSAG unit of FIG. 2;

in the drawings:

1-DSAG unit, 11-first feature extraction component, 12-second feature extraction component, 13-first space attention module, 14-second space attention module, 15-convolution residual block, 2-GM pooling layer, 3-GAP layer, 4-full connection layer, 5-softmax layer and 6-face image to be recognized.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

example (b):

fig. 1 is a schematic diagram of an overall structure of a face recognition network in this embodiment, where a DSAG unit 1 and a GM pooling layer 2 are correspondingly arranged, the number of the DSAG unit 1 and the GM pooling layer 1 is 4, a specific structure of the DSAG unit 1 is shown in fig. 2, in each DSAG unit 1, four convolution residual blocks 15 are respectively arranged in the first feature extraction component 11 and the second feature extraction component, which are sequentially connected, a structure of the convolution residual block 15 is shown in fig. 3, and structures of the first spatial attention module 13 and the second spatial attention module 14 are shown in fig. 4. When the model is trained, the cross entropy is adopted as a loss function, the epoch is set to be 1500, and the batch-size is set to be 16.

Taking the length, width and channel number of the input face image 6 to be recognized as 112 × 112 × 3, respectively, in the network, the first convolution operation of the first convolution residual block 15 in the first DSAG unit 1 is used to increase the channel number of the feature map, and the size of the output feature map is 112 × 112 × 64. In each DSAG unit 1, the first convolution operation of the first convolution residual block 15 in the second feature extraction component 12 also serves to increase the number of feature map channels, which is twice the number of feature map output channels as the input. For other convolution operations within the block of convolution residues 15 in the network, the length and width of the feature map and the channel size are not changed before and after the convolution. The size of the network feature map at different positions in the embodiment is specifically shown in the following table:

the GAP layer 3 is used for carrying out global average pooling operation on the intermediate feature graph, for the full connection layer 4, the number of input nodes is 1024, and the number of output nodes is set according to the total number of identities needing to be identified actually. It should be noted that, according to the application scenario, the softmax layer 5 may be removed, and the open set identification is realized by calculating the distance between the output feature vector of the full connection layer 4 and the preset sample feature vector.

The VGG19, the ResNet101 and the face recognition network provided by the invention are respectively trained by using the same training set, and then the test is carried out on the same test set, and the results are shown in the following table:

from the above results, it can be seen that, compared with the prior art, the recognition accuracy of the face recognition network provided by the invention on the low-resolution face image is greatly improved, and a substantial progress is made.

On the basis of the present embodiment, the GM pooling layer 2 is replaced by a normal maximum pooling layer (pooling window size 3 × 3, step size 2), and the rest of the network is unchanged, resulting in a comparative network a. On the basis of the present embodiment, the connection portion between the first spatial attention module 13 and the second spatial attention module 14 is removed, the calibration chart of the first spatial attention module 13 is cancelled and input into the second spatial attention module 14, and the other portions of the network are not changed, so as to construct the comparison network B. The exact same training and testing procedure was used and the test results are shown in the following table:

from the above results, the GM pooling layer 2 and the DSAG unit 1 provided by the present invention both have an obvious positive effect on improving the accuracy of network recognition on low resolution face images.

The above-mentioned embodiments only express the specific embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. A face recognition method based on a convolutional neural network is characterized in that: the method comprises the following steps:

wherein the DSAG unit may be represented by the following mathematical model:

T1＝f_RC ¹(Y_n-1)

T2＝f_SA ¹(T1)*T1

T3＝f_RC ²(T2)

Y_n＝f_SA ²(T3,U)*T3

wherein, Y_n-1And Y_nRespectively representing the input and output of the DSAG unit, f_RC ¹() Representing a first feature extraction component, f_RC ²() Representing a second feature extraction component, f_SA ¹() Representing a first spatial attention module, f_SA ²() Represents a second spatial attention module, U represents a calibration map input to the second spatial attention module from the first spatial attention module, and T3 and U are both inputs to the second spatial attention module.

2. The face recognition method based on the convolutional neural network as claimed in claim 1, wherein: the first feature extraction component and the second feature extraction component each include a plurality of convolution residual blocks connected in sequence, and the convolution residual blocks can be expressed by the following formula;

M_n＝λ₂(f₂(λ₁(f₁(M_n-1))))+M_n-1

3. The face recognition method based on the convolutional neural network as claimed in claim 1, wherein: the first spatial attention module may be represented by the following equation:

Z1＝θ1(f_CS ¹(<AP1(T1),MP1(T1),DP1(T1)>))

wherein T1 is the characteristic diagram input into the first spatial attention module, and Z1 represents the first spatial attention moduleFirst calibration map of spatial attention module output, f_CS ¹Represents the convolution operation with convolution kernel size of 1 x1, theta 1 represents sigmoid activation function,<·>represents performing a splicing operation, AP1() represents an average pooling operation layer, MP1() represents a maximum pooling operation layer, and DP1() represents a median pooling operation layer.

4. The face recognition method based on the convolutional neural network as claimed in claim 3, wherein: the second spatial attention module may be represented by the following mathematical model:

X1＝AP1(T1)+MP1(T1)-AP2(T3)

X2＝MP1(T1)+MP2(T3)

X3＝θ2(f_CS ²(<X1,X2,AP2(T3),DP1(T1),DP2(T3)>))

wherein T1 is a feature map input to the first spatial attention module, T3 is a feature map input to the second spatial attention module, X3 represents a second calibration map output by the second spatial attention module, AP1() and AP2() represent average pooling levels in the first and second spatial attention modules, MP1() and MP2() represent maximum pooling levels in the first and second spatial attention modules, DP1() and DP2() represent median pooling levels in the first and second spatial attention modules, respectively, f_CS ²Represents the convolution operation with convolution kernel size of 1 x1, theta 2 represents sigmoid activation function,<·>indicating that a splicing operation is performed.

5. The face recognition method based on the convolutional neural network as claimed in claim 1, wherein: the pooling window size of the GM pooling layer is 3 x3, the step size is 2, and the operation of the GM pooling layer can be represented as the following mathematical model:

K1＝sort(P)

K2＝Avg(max1(K1)+max2(K1)+max3(K1))+max1(K1)

where P is a matrix of 3 × 3 size input to the GM pooling layer, sort (P) represents sorting elements in the matrix P from large to small, max1(K1) represents obtaining the value of an element located at the first position in the number series K1, max2(K1) represents obtaining the value of an element located at the second position in the number series K1, max3(K1) represents obtaining the value of an element located at the third position in the number series K1, and Avg () represents averaging operation.