CN109543606A

CN109543606A - A kind of face identification method that attention mechanism is added

Info

Publication number: CN109543606A
Application number: CN201811396296.8A
Authority: CN
Inventors: 郑伟诗; 叶海佳
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2019-03-29
Anticipated expiration: 2038-11-22
Also published as: CN109543606B

Abstract

The invention discloses a kind of face identification methods that attention mechanism is added, Face datection and face registration process first are carried out to data set with cascade neural network, the deep neural network that attention mechanism is added is constructed again, test sample input is finally trained attention mechanism network and carries out recognition of face by training attention mechanism network.The present invention constructs attention mechanism using STN module, and different STN modules is input to after each stage of deep neural network, the output result fusion of the series connection output result and deep neural network of STN module is got up, as output feature.In order to which network can adaptively be learnt to the region of interest characteristic of field with judgement index, the present invention uses the method for carrying out affine transformation operation to input by STN module, strengthen understanding and study of the network to local message, on existing recognition of face network, the accuracy rate for improving recognition of face enhances the robustness of identifying system.

Description

A kind of face identification method that attention mechanism is added

Technical field

The present invention relates to machine deep learning, image procossing identification field more particularly to a kind of addition attention mechanism Face identification method.

Background technique

Recognition of face is one of project most challenging in computer vision field and machine learning field in recent years, by The effective recognition of face of extensive concern success for having arrived researchers has broad application prospects, can be in national defense safety, video The scenes such as monitoring, human-computer interaction and video index are played a great role.

Currently, the feature extraction Web vector graphic Classification Loss (Softmax Loss) for being mostly based on CNN is instructed as network Experienced supervisory signals, these networks are the distance between different classes of in the training process gradually to increase to be classified as learning objective Greatly.Deepface uses sorter network method, while using complicated 3D alignment thereof and a large amount of training data.DeepID is then It is that piecemeal is carried out to face picture first, feature extraction then is carried out to different faces block using multiple sorter networks, is finally made With joint bayesian algorithm these features are merged, due to the technology be to different faces block carry out feature extraction, so Data set increases several times than original image, and the training time greatly increases, and computing resource consumption is big.In addition these face blocks are all tight Lattice fix division mode, and for side face or irregular face picture, then the accuracy rate can have a greatly reduced quality, and algorithm is inadequate Robust.

Summary of the invention

In order to overcome the shortcomings of the prior art, the present invention provides a kind of recognition of face side that attention mechanism is added Method, by paying attention to power module, neural network can learn automatically to the face block feature with identification, rather than fixed partition Face block is more advantageous to promotion classification accuracy with the feature that such method is extracted, and robustness is stronger.Simultaneously because paying attention to Power module is simple for structure, so computing resource consumption is few, network convergence rate is fast.

In order to achieve the above object, the invention adopts the following technical scheme:

The present invention discloses a kind of face identification method that attention mechanism is added, and includes the following steps:

S1: image preprocessing, the facial image being aligned are carried out using cascade convolutional neural networks；

S2: data augmentation is carried out to pretreated image, the data augmentation includes random cropping and random overturning behaviour Make, go out the size area set by step S1 treated image random cropping, image is overturn with the probability of setting, Whitening processing finally is done to image, the image being sized then directly is normalized into for test sample, is then carried out at albefaction Reason, it is described be sized it is identical as being sized for random cropping；

S3: setting attention mechanism module learns to the face block feature with identification automatically for network, utilizes note Power mechanism module anticipate for the image progress convolution operation of input, then carries out connecting to return entirely exporting M angle value, M is nature Number constructs matrix based on M angle value, the local feature of image is extracted by matrix operation；

S4: building attention mechanism network, extracts characteristics of image using deep neural network, and attention mechanism mould is added Block, the attention mechanism network include main road and branch, and the main road is that picture is defeated by obtaining after deep neural network Out, the branch is that different attention mechanism modules is passed through in the output in each stage of deep neural network, then successively carries out The output of main road and branch is finally carried out merging features, obtains final image by the output obtained after elementwise-add Characteristic pattern, for calculating loss function and as the feature of recognition of face；

S5: training attention mechanism network is trained simultaneously attention mechanism network using recognition of face loss function And it saves；

S6: characteristics of image is extracted, test sample is input in trained attention mechanism network, good figure is obtained As feature；

S7: recognition of face classifies the characteristics of image that extraction obtains with softmax homing method, completes test specimens This identification.

As a preferred technical solution, cascade convolutional neural networks described in step S1 use MTCNN, including P-Net, R-Net and O-Net gives any one testing image, zooms to different proportion, constructs image pyramid, then sequentially inputs P-Net, R-Net and O-Net extract face candidate frame, further include that fitting face and non-face classification, frame recurrence and face are special The target training that sign point coordinate returns, specific loss function are as described below:

MTCNN carries out face and non-face classification uses cross entropy as loss function, is denoted as L_det, calculation formula is as follows:

Wherein, p⁽ⁱ⁾For the probability of model prediction,For test sample x⁽ⁱ⁾Label,

MTCNN carries out frame recurrence and uses L2Loss as loss function, is denoted as L_box, calculation formula is as follows:

Wherein,It is the regressand value of model prediction,It is test sample x⁽ⁱ⁾True coordinate value, and

MTCNN carries out the recurrence of human face characteristic point coordinate and equally uses L2Loss as loss function, is denoted as L_landmark, calculate Formula is as follows:

Wherein,It is the regressand value of model prediction,It is test sample x⁽ⁱ⁾The coordinate of real human face characteristic point Value, and

The MTCNN introduces catalogue scalar functions as a preferred technical solution, participates in damage for excluding non-face data The calculating of function is lost, the catalogue scalar functions calculation formula is as follows:

Wherein, N indicates training sample sum, α_jIndicate significance level of the corresponding objective function in total objective function, Associated weight for P-Net or R-Net is (α_det=1, α_box=0.5, α_landmark=0.5)；For the associated weight of ONet For (α_det=1, α_box=0.5, α_landmark=1).

Attention mechanism module described in step S3 uses STN module, the STN module packet as a preferred technical solution, Localized network module, mesh generator and sampler are included,

The picture of input is carried out convolution operation by the localized network module, is then carried out full connection and is returned out 6 angles Angle value forms the matrix of 2*3,

The mesh generator calculates each position in target figure V by matrix operation and corresponds to coordinate in original image U Position generates T_θ(G_i), specific formula for calculation is as described below:

Wherein,The coordinate of original graph is represented,Represent the coordinate of target figure, A_θTo localize net 6 angle values that network module network returns out,

The sampler is sampled in original graph U according to the coordinate information in T (G), and the pixel in U is copied to In target figure V.

As a preferred technical solution, in step S4, the basic network of the deep neural network uses resnet50, Resnet50 includes 5 stage, described in detail below:

Stage0: including convolutional layer and pond layer, the convolution kernel size of the convolutional layer is 7x7, and output channel number is 64, Step-length is 2, and the pond layer uses the pond mode of maxpooling, window size 3x3, step-length 2；

Stage1: it is made of the block that 3 output channel numbers are 256；

Stage2: it is made of the block that 4 output channel numbers are 512；

Stage3: it is made of the block that 5 output channel numbers are 1024；

Stage4: it is made of the block that 6 output channel numbers are 2048；

The characteristics of image figure difference that the branch network obtains the stage0,1,2,3,4 of basic network resnet50 is defeated Enter into each STN module, obtain feature L0, L1, L2, L3, L4, the L1-L4 does a convolution operation, convolution kernel size For 1x1, step-length 1, output channel number is the port number of a upper feature, with the mode of elementwise-add these spies Sign is successively added, specific calculation are as follows:

L0+f(L1)+f(L2)+f(L3)+f(L4)

Wherein "+" it is that elsemenwise-add is operated, f () is convolution operation.

Described piece of structure forming step is described in detail below as a preferred technical solution:

Dimensionality reduction is carried out using 1x1 convolution, then carries out 3x3 convolution operation, then rise dimension with 1x1 convolution, export with it is defeated It is entering to carry out to obtain after elementwise-add operation as a result,

The full articulamentum for being eventually adding one 128 dimension carries out dimensionality reduction.

Recognition of face loss function described in step S5 uses Softmax function as a preferred technical solution, is based on The road K of the disaggregated model of Softmax function exports are as follows:

Whereinb_kFor Softmax layers of two parameters, K group weight and biasing are indicated.

Use unactivated full articulamentum for described Softmax layers as a preferred technical solution,.

As a preferred technical solution, after the Softmax layers of output transform K class posterior probability are as follows:

For the maximum probability of each test sample generic, Softmax Loss is defined are as follows:

Wherein θ indicates model parameter, x⁽ⁱ⁾Indicate test sample y⁽ⁱ⁾Generic.

The disaggregated model based on Softmax function further includes optimizer as a preferred technical solution, and optimizer is adopted Use Adam.

Compared with the prior art, the invention has the following advantages and beneficial effects:

It (1) is starting point the present invention is based on the face local feature for more having identification is extracted, in the frame of base neural network Attention mechanism module is devised under frame, and is combined with unique connection type and deep neural network, is formd unique The face identification method of attention mechanism is added, the face characteristic of abundant classification relevant information can be extracted.

(2) present invention carries out data augmentation, including random cropping and random turning operation to pretreated image, is used for Increase the sample data of training, the robustness of Strengthens network is capable of in the data amplification of training set.

(3) attention mechanism module of the invention uses STN module, and STN module includes localized network module, and grid is raw It grows up to be a useful person and sampler, the STN modular structure is succinct, and computing resource consumption is few, and network convergence rate is fast.

Detailed description of the invention

Fig. 1 is the structural schematic diagram that face of the present invention is aligned network；

Fig. 2 is the structural schematic diagram of STN module of the present invention；

Fig. 3 is the structural schematic diagram of depth of foundation convolutional neural networks of the present invention；

Fig. 4 is the block structure schematic diagram in depth of foundation convolutional neural networks of the present invention；

Fig. 5 is the structural schematic diagram of attention mechanism network of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

The present embodiment discloses a kind of face recognition algorithms based on addition attention mechanism, and the algorithm includes following step It is rapid:

Step 1: the data prediction of Face datection face alignment, the cascade of use are carried out using cascade neural network Convolutional neural networks be MTCNN, MTCNN cascade structure is mainly made of 3 convolutional neural networks, respectively P-Net, R- Net and O-Net.A picture to be detected is given, picture can be first scaled to different ratios, empty with the scale for constructing picture Between, three networks are sequentially input, then to extract face candidate frame.As shown in Figure 1, there are three stage compositions for the algorithm: first Stage, the CNN of shallow-layer quickly generate candidate forms；Second stage refines candidate forms by more complicated CNN, abandons a large amount of Overlapping forms；Phase III, the more powerful CNN of use realize candidate forms going or staying, while showing five facial key points Positioning.When carrying out model training, in order to merge Face datection and face alignment task, MTCNN is fitted 3 mesh simultaneously Mark: face/non-face classification, frame return and human face characteristic point coordinate returns.Three loss functions are respectively:

(1) face/non-face classification

Face/non-face is that two classification problems so MTCNN uses cross entropy as loss function are denoted as L_det。 For each test sample x⁽ⁱ⁾,

(2) frame returns: the purpose that frame returns is to estimate and neighbouring real human face area each face candidate frame The offset in domain, including the left side, top, width and height.So frame recurrence be a regression problem, using above-mentioned 4 numerical value as Regressive object is denoted as L so MTCNN uses L2Loss as loss function_box.For each test sample x⁽ⁱ⁾,

It wherein, is the regressand value of model prediction,It is test sample x⁽ⁱ⁾True coordinate value, because of target to be returned There are 4 values, so

(3) human face characteristic point coordinate returns

The recurrence of human face characteristic point coordinate is equally a regression problem, since MTCNN only detects 5 human face characteristic points, and Each characteristic point includes x, y-coordinate, so one shares 10 regressive objects.Equally use L2Loss as loss function here, It is denoted as L_landmark.For each test sample x⁽ⁱ⁾:

Wherein,It is the regressand value of model prediction,It is test sample x⁽ⁱ⁾The coordinate of real human face characteristic point Value, because target to be returned has 10 values,

(4) catalogue scalar functions

It allows and model while being fitted different targets, need using different types of training data, such as non-face picture, portion Divide face picture, band characteristic point mark human face data etc., but not all data are all significant to all objective functions, such as Non-face data are to L_landmarkIt is not significant.It is not that every kind of sample requires to participate in being damaged thus when training The calculating of function is lost, in order to carry out distinguishing to different samples, MTCNN introduces sample type labelIndicate sample x⁽ⁱ⁾Whether type j is belonged to, and then general objective function representation is

Wherein, N indicates training sample sum, α_jIndicate significance level of the corresponding objective function in total objective function, For P-Net and R-Net, associated weight is (α_det=1, α_box=0.5, α_landmark=0.5)；And for ONet, in order to guarantee The accuracy of human face characteristic point improves the weight of characteristic point coordinate regressive object function, becomes (α_det=1, α_box=0.5, α_landmark=1)

Step 2: data augmentation

Data augmentation uses random cropping and random turning operation, the former will by step 1 treated picture with Machine cuts out the region 160x160, and the latter overturns picture with 0.5 probability.Albefaction finally is carried out to picture.Test sample It is then directly normalized into the picture of 160x160 size, then equally carries out albefaction.

Step 3: design attention mechanism module

Attention mechanism module is using STN module: as shown in Fig. 2, STN module is by localized network module (Localisation Network), mesh generator (Grid generator), 3 parts of sampler (Sampler) form.

Localisation Network: the network is exactly a simple Recurrent networks.The picture of input is carried out several A convolution operation, then full connection returns out 6 angle values (assuming that being affine transformation), the matrix of 2*3.

Grid generator: mesh generator is responsible for the coordinate position in V by matrix operation, calculating target Each position in figure V corresponds to the coordinate position in original image U, i.e. generation T_θ(G_i)。

Here Grid sampling process is exactly simple square for two dimensional affine transformation (rotation translates, scaling) Battle array operation:

In above formula,The coordinate of original graph is represented,Represent the coordinate of target figure.A_θFor 6 angle values that Localisation Network net regression goes out.

Sampler: sampler is according to T_θ(G_i) in coordinate information, sampled in original graph U, the pixel in U answered It makes in target figure V.

Step 3: attention mechanism network is built

The method that feature extraction uses deep neural network, the basic network of use is resnet50, then this base again Attention mechanism module is added on plinth.And attention mechanism module is using STN module: input feature vector is carried out several convolution Operation, then full connection returns out 6 angle values (assuming that being affine transformation), the matrix of 2*3.Then it inputs multiplied by this matrix It can obtain the significant feature in part.

Network is divided into main road and branch, and main road is the output that picture is obtained by resnet50, and branch is by different The output that elementwise-add is obtained successively is carried out after STN module again.

Main road: resnet50 is made of 5 stages, wherein each stage includes the operation of several convolution sum pondizations.

As shown in figure 3, resnet50 first is divided by output characteristic pattern size, 5 stage, each stage can be divided into The characteristic pattern size of output is all different.

Stage0 has a convolutional layer and pond layer, and convolution kernel size is 7x7, and output channel number is 64, step-length 2.Pond Change using maxpooling, window size 3x3, step-length 2.

Stage1 is made of the block (block) that 3 output channel numbers are 256.

Stage2 is made of the block (block) that 4 output channel numbers are 512.

Stage3 is made of the block (block) that 5 output channel numbers are 1024.

Stage4 is made of the block (block) that 6 output channel numbers are 2048.

As shown in figure 4, wherein the structure of each block is first to carry out dimensionality reduction with a 1x1 convolution, then carry out 3x3 convolution, finally rises dimension with 1x1 convolution again, and output does elementwise-add operation with input, obtains result.

The full articulamentum for being most followed by one 128 dimension carries out information integration.

Branch: the characteristic pattern that stage0,1,2,3,4 is obtained respectively is input in each STN module and obtains respective spy Sign:

Output of the stage0 after STN is L0；

Output of the Stage1 after STN is L1；

Output of the Stage2 after STN is L2；

Output of the Stage3 after STN is L3；

Output of the Stage4 after STN is L4；

As shown in figure 5, remaining feature all does a convolution operation in addition to first feature, convolution kernel size is 1x1, Step-length is 1, output channel number be a upper feature port number, with the mode of elementwise-add these features successively Fusion is got up, so the meaning for doing convolution operation is exactly for changing characteristic dimension, so as to feature phase add operation.Specific addition side Method is as follows:

L0+f(L1)+f(L2)+f(L3)+f(L4)

Main road output and branch output can be thus obtained, the output of two-way is finally carried out merging features, is obtained final Feature.This feature, which will be directly used in, calculates loss function and as the feature of recognition of face.

Step 5: training attention mechanism neural network

In the present embodiment, when constructing Softmax disaggregated model, feature is exported and inputs the road K Softmax floor for x by we (being realized using unactivated full articulamentum), to calculate sample about different classes of posterior probabilityWherein K represents classification number Mesh.Softmax layers include two parameters, W and b, and then kth road exportsIt can be expressed as again:

But since the output of full articulamentum is any number, in order to which sample is about different classes of normalization probability, we Need the posterior probability about kth class then obtained to Softmax layers of output transform are as follows:

In the present embodiment, in order to maximize maximum probability of each sample about generic, we can be defined Softmax Loss are as follows:

θ indicates model parameter, x⁽ⁱ⁾Indicate sample y⁽ⁱ⁾Generic.

In the present embodiment, optimizer uses Adam, and weight decays to 5e-5, and batch size is 128, average pond layer Output is operated using dropout, and keeping probability is 0.8.Learning rate adjustable strategies are as follows: first using 0.1 as learning rate to training set Then 3 wheel of training is reduced to 0.01 training, 2 wheel, be then reduced to 0.001 training, 2 wheel again, totally 7 wheel.Every classification for having instructed a wheel Model can all be verified on LFW, and finally trained disaggregated model is saved.

Step 6: learn the high-level characteristic and abstract characteristics of image

Characteristics of image is extracted, test sample is input in trained attention mechanism network, good image is obtained Feature.

Step 7: recognition of face

The characteristics of image that extraction obtains is classified with softmax homing method, completes the identification of test sample.

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims

1. a kind of face identification method that attention mechanism is added, which is characterized in that include the following steps:

S2: data augmentation is carried out to pretreated image, the data augmentation includes random cropping and random turning operation, warp It crosses step S1 treated that image random cropping goes out the size area set, image is overturn with the probability of setting, finally Whitening processing is done to image, the image being sized then directly is normalized into for test sample, then carries out whitening processing, institute State be sized it is identical as being sized for random cropping；

S3: setting attention mechanism module learns to the face block feature with identification automatically for network, utilizes attention The image of input is carried out convolution operation by mechanism module, is then carried out full connection and is returned M angle value of output, M is natural number, base Matrix is constructed in M angle value, the local feature of image is extracted by matrix operation；

S4: building attention mechanism network, extracts characteristics of image using deep neural network, and attention mechanism module is added, The attention mechanism network includes main road and branch, and the main road is that picture passes through the output obtained after deep neural network, The branch is that different attention mechanism modules is passed through in the output in each stage of deep neural network, then successively carries out The output of main road and branch is finally carried out merging features, obtains final image by the output obtained after elementwise-add Characteristic pattern, for calculating loss function and as the feature of recognition of face；

S5: training attention mechanism network is trained and is protected to attention mechanism network using recognition of face loss function It deposits；

S6: extracting characteristics of image, test sample be input in trained attention mechanism network, and it is special to obtain good image Sign；

S7: recognition of face classifies the characteristics of image that extraction obtains with softmax homing method, completes test sample Identification.

2. a kind of face identification method that attention mechanism is added according to claim 1, which is characterized in that in step S1 The cascade convolutional neural networks use MTCNN, including P-Net, R-Net and O-Net, give any one testing image, Different proportion is zoomed to, image pyramid is constructed, then sequentially inputs P-Net, R-Net and O-Net, extracts face candidate frame, Further include the target training for being fitted face and non-face classification, frame recurrence and the recurrence of human face characteristic point coordinate, specifically loses letter Number is as described below:

MTCNN carries out the recurrence of human face characteristic point coordinate and equally uses L2Loss as loss function, is denoted as L_landmark, calculation formula It is as follows:

Wherein,It is the regressand value of model prediction,It is test sample x⁽ⁱ⁾The coordinate value of real human face characteristic point, And

3. a kind of face identification method that attention mechanism is added according to claim 2, which is characterized in that described MTCNN introduces catalogue scalar functions, and the calculating of loss function is participated in for excluding non-face data, and the catalogue scalar functions calculate Formula is as follows:

Wherein, N indicates training sample sum, α^jSignificance level of the corresponding objective function in total objective function is indicated, for P- The associated weight of Net or R-Net is (α_det=1, α_box=0.5, α_landmark=0.5)；Associated weight for ONet is (α_det =1, α_box=0.5, α_landmark=1).

4. a kind of face identification method that attention mechanism is added according to claim 1, which is characterized in that step S3 institute Attention mechanism module is stated using STN module, the STN module includes localized network module, mesh generator and sampler,

The picture of input is carried out convolution operation by the localized network module, is then carried out full connection and is returned out 6 angle values, The matrix of 2*3 is formed,

The mesh generator calculates each position in target figure V by matrix operation and corresponds to coordinate position in original image U, Generate T_θ(G_i), specific formula for calculation is as described below:

Wherein,The coordinate of original graph is represented,Represent the coordinate of target figure, A_θFor localized network mould 6 angle values that block net regression goes out,

The sampler is sampled in original graph U according to the coordinate information in T (G), and the pixel in U is copied to target Scheme in V.

5. a kind of face identification method that attention mechanism is added according to claim 1, which is characterized in that step S4 In, it includes 5 stage that the basic network of the deep neural network, which uses resnet50, resnet50, described in detail below:

Stage0: including convolutional layer and pond layer, the convolution kernel size of the convolutional layer is 7x7, and output channel number is 64, step-length It is 2, the pond layer uses the pond mode of maxpooling, window size 3x3, step-length 2；

Stage1: it is made of the block that 3 output channel numbers are 256；

Stage2: it is made of the block that 4 output channel numbers are 512；

Stage3: it is made of the block that 5 output channel numbers are 1024；

Stage4: it is made of the block that 6 output channel numbers are 2048；

The characteristics of image figure that the stage0,1,2,3,4 of basic network resnet50 is obtained is separately input to by the branch network In each STN module, feature L0, L1, L2, L3, L4 are obtained, the L1-L4 does a convolution operation, and convolution kernel size is 1x1, step-length 1, output channel number is the port number of a upper feature, with the mode of elementwise-add these features It is successively added, specific calculation are as follows:

L0+f(L1)+f(L2)+f(L3)+f(L4)

6. a kind of face identification method that attention mechanism is added according to claim 5, which is characterized in that described piece Structure forming step is described in detail below:

Dimensionality reduction is carried out using 1x1 convolution, then carries out 3x3 convolution operation, then rise dimension with 1x1 convolution, export and input into Row elementwise-add operation after obtain as a result,

7. a kind of face identification method that attention mechanism is added according to claim 1, which is characterized in that in step S5 The recognition of face loss function uses Softmax function, and the road K of the disaggregated model based on Softmax function exports are as follows:

8. a kind of face identification method that attention mechanism is added according to claim 7, which is characterized in that described Softmax layers use unactivated full articulamentum.

9. a kind of face identification method that attention mechanism is added according to claim 8, which is characterized in that described The posterior probability of K class after Softmax layers of output transform are as follows:

10. a kind of face identification method that attention mechanism is added according to claim 7, which is characterized in that the base It further include optimizer in the disaggregated model of Softmax function, optimizer uses Adam.