CN115661911A

CN115661911A - Face feature extraction method, device and storage medium

Info

Publication number: CN115661911A
Application number: CN202211658800.3A
Authority: CN
Inventors: 朱文忠; 肖顺兴; 车璇; 李韬; 杜洪文; 谢康康; 谢林森
Original assignee: Sichuan University of Science and Engineering
Current assignee: Sichuan University of Science and Engineering
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-01-31
Anticipated expiration: 2042-12-23
Also published as: CN115661911B

Abstract

The invention discloses a face feature extraction method, face feature extraction equipment and a storage medium, and belongs to the technical field of face recognition. The feature extraction method comprises the steps of obtaining a face image, obtaining a trained feature extraction network model, extracting basic feature information of the face image by using a basic operation layer, inputting a basic feature map into a first deep tone feature extraction monomer, taking a feature map output by an upstream deep tone feature extraction monomer of a next deep tone feature extraction monomer as input, and generating and outputting a corresponding intermediate feature map; repeating continuously until the last deep tone feature extraction monomer generates and outputs a final-stage feature map; and inputting the final-stage feature map into a feature shaping unit to generate a face feature vector, and finishing face feature extraction. The feature extraction network model of the invention gradually modulates the feature information by setting a plurality of space attention mechanisms, and the network can well eliminate noise and extract the core feature information.

Description

Face feature extraction method, device and storage medium

Technical Field

The invention belongs to the technical field of face recognition, and particularly relates to a face feature extraction method, face feature extraction equipment and a storage medium.

Background

With the improvement and popularization of hardware performance, the face recognition technology gradually leaves out of laboratories and enters the daily life of people. Through long-term development, many face recognition algorithms can well deal with common real scenes at present, and satisfactory recognition accuracy is achieved. However, when the quality of the obtained face image is poor (such as unsatisfactory illumination, great posture change, various expression changes, etc.), the existing algorithm still has the problem of poor robustness, and especially for facial image changes caused by age changes, the existing algorithm is difficult to effectively extract required feature information.

Disclosure of Invention

In view of the above-mentioned deficiencies in the prior art, the present invention provides a method, an apparatus and a storage medium for extracting facial features, so as to effectively learn and extract feature information in an age-spanning facial image, and improve the accuracy of identifying the age-spanning facial image.

In order to achieve the above purpose, the solution adopted by the invention is as follows: a face feature extraction method comprises the following steps:

s100, obtaining a face image, and obtaining a trained feature extraction network model; the characteristic extraction network model is sequentially provided with a basic operation layer, a plurality of deep tone characteristic extraction monomers and a characteristic shaping unit, wherein the plurality of deep tone characteristic extraction monomers are sequentially connected in series;

s200, inputting the face image into the feature extraction network model, extracting basic feature information of the face image by using the basic operation layer, and then generating a basic feature map;

s300, inputting the basic feature map into a first deep tone feature extraction monomer, and outputting a primary feature map by the first deep tone feature extraction monomer after feature extraction operation;

s400, the next deep tone feature extraction monomer takes the feature graph output by the upstream deep tone feature extraction monomer as input, then carries out feature extraction operation, generates and outputs a corresponding intermediate feature graph;

s500, continuously repeating the step S400 until the last deep tone feature extraction monomer generates and outputs a final-stage feature map;

s600, inputting the final-stage feature map into the feature shaping unit, and after shaping operation is carried out on the final-stage feature map, generating a face feature vector to finish face feature extraction;

the calculation operation process inside the deep tone feature extraction monomer is represented as the following mathematical model:

wherein,

a precursor feature map representing input said depth feature extraction singlets,

、

and

respectively representing a first convolution operation, a second convolution operation and a third convolution operation,

、

、

、

and

respectively representing a first attention module, a second attention module, a third attention module, a fourth attention module and a fifth attention module,

it is shown that the elements correspond to a product operation,

、

and

respectively representing a first activation function, a second activation function and a third activation function,

、

and

respectively represent

、

And

the first characteristic diagram, the second characteristic diagram and the third characteristic diagram generated after activation,

a fourth feature map obtained by calibrating the third feature map by the third attention module is shown,

a multi-scale fusion unit is represented,

representing a side branch feature map generated by fusing the first feature map, the second feature map and the third feature map by the multi-scale fusion unit,

a fifth feature map obtained by adding the side branch feature map after calibration to the fourth feature map for the fourth attention module,

a first process feature map output from the first attention module,

a second process feature map output from the second attention module,

a third process feature map output from the third attention module,

and

respectively represent a fourth process profile anda fifth pass Cheng Tezheng map, the fourth process feature map and the fifth process feature map both output from the fourth attention module,

representing a dimension-varying unit for increasing the number of feature map channels and decreasing feature map width and height dimensions,

and representing the back-driving feature map of the deep tone feature extraction monomer output.

Further, the convolution kernel sizes of the first convolution operation, the second convolution operation and the third convolution operation are all 3*3, and the step sizes are all 1; the first activation function, the second activation function, and the third activation function are all ReLU functions.

Further, the internal operation process of the multi-scale fusion unit is represented as the following mathematical model:

wherein,

、

and

respectively showing a first feature map, a second feature map and a third feature map, the multi-scale fusion unit taking the first feature map, the second feature map and the third feature map as input,

representing a side-branch feature map as an output of the multi-scale fusion unit,

it is shown that the elements correspond to a product operation,

representing a first melting characteristic diagram generated by adding the first characteristic diagram, the second characteristic diagram and the third characteristic diagram,

representing a second melting characteristic diagram generated by multiplying the first characteristic diagram, the second characteristic diagram and the third characteristic diagram by element correspondence,

it is shown that the feature maps are stitched together,

and

respectively representing a fourth convolution operation and a fifth convolution operation, the convolution kernel sizes of the fourth convolution operation and the fifth convolution operation are 1*1, and the step sizes are 1,

and

respectively representing a fourth activation function and a fifth activation function, both the fourth activation function and the fifth activation function being ReLU functions,

representing a third melting profile generated upon activation of the fourth activation function.

Further, a hierarchical pooling layer and a hierarchical activation function are arranged in each of the first attention module, the second attention module and the third attention module, the hierarchical pooling layer is arranged at an upstream end of the hierarchical activation function, the hierarchical pooling layer is used for performing global maximum pooling operation on the feature map in the channel direction, and the hierarchical activation function is sigmoid;

the first, second, and third process profiles are matrices of hierarchical pooling outputs of the first, second, and third attention modules, respectively.

Furthermore, a branch pooling layer, an introduction full-connection layer, an introduction activation layer, an extraction full-connection layer and an extraction activation layer are sequentially arranged in the fourth attention module; the branch pooling layer is used for performing global maximum pooling operation on the feature map in the space direction, the lead-in activation layer is a nonlinear activation function ReLU, and the lead-out activation layer is a nonlinear activation function sigmoid;

the fourth process characteristic diagram is a vector output after the operation of the branch pooling layer, and the fifth process characteristic diagram is a vector output after the activation of the lead-out activation layer.

Further, the mathematical model of the fifth attention module is:

wherein,

、

、

、

and

the first process profile, the second process profile, the third process profile, and a fourth process profile representing input to a fifth attention module, respectively,The fourth process signature and the fifth process signature Cheng Tezheng,

showing the splicing operation to the characteristic diagrams therein,

representing a first internal reference feature map generated by splicing the first process feature map, the second process feature map and the third process feature map,

and

respectively representing a first bridging fully connected layer and a second bridging fully connected layer,

、

and

respectively representing a first bridging activation function, a second bridging activation function and an integration activation function,

and

respectively representing a second internal reference feature diagram and a third internal reference feature diagram generated after the first bridging activation function and the second bridging activation function are activated,

it is shown that the elements correspond to a product operation,

which represents the integrated pooling layer of the pool,the integrated pooling layer is used for performing global maximum pooling operation on the feature map in the channel direction,

an integrated attention map representing the output of the fifth attention module.

Furthermore, the dimension-changing unit comprises a dimension-changing convolutional layer, a dimension-changing activation layer and a dimension-changing pooling layer which are sequentially arranged, the convolutional kernel size of the dimension-changing convolutional layer is 3*3, the step length is 1, the dimension-changing activation layer is a ReLU function, the dimension-changing pooling layer is used for performing maximum pooling operation on the feature map, and the pooling window size of the dimension-changing pooling layer is 2*2, and the step length is 2.

Further, the feature shaping unit comprises a shaping pooling layer, a trunk full-connection layer and a shaping activation layer which are connected in sequence, the shaping pooling layer is used for performing global average pooling operation on the feature map in the spatial direction, and the shaping activation layer is a sigmoid function.

The invention also provides a face feature extraction device, which comprises a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the face feature extraction method by loading the computer program.

The present invention also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the face feature extraction method as described above.

The invention has the beneficial effects that:

(1) With the increase of the depth of the network, the receptive field of the characteristic diagram is gradually increased, the characteristic extraction network model of the invention gradually modulates the characteristic information by setting a plurality of space attention mechanisms, and more accurately calibrates different space position information, and the network can well eliminate noise and extract the core characteristic information for the image difference caused by different ages;

(2) Research shows that after the multi-scale fusion unit is adopted to fuse the hierarchy information, the effect of improving the network performance by using one channel attention mechanism is similar to that of the channel attention mechanism arranged on each hierarchy, but the calculation amount is smaller than that of the channel attention mechanisms, and the network is lighter, because the multi-scale fusion unit well removes the interference information in the hierarchy information in the process of fusing the hierarchy information, the calibration efficiency of the channel attention mechanism is greatly improved;

(3) The network architecture of a Transformer and the like based on a pure attention mechanism proves that a large amount of effective information still exists in the attention mechanism, the information is fully utilized, and complex feature mapping can be realized, while in the existing convolutional neural network, only the utilization of the information at the tail output end of the attention mechanism is emphasized, and the information exchange between the middle of the attention mechanism and other parts of the network is lacked, so that the calibration effect of the attention mechanism is limited, and the nonlinear fitting capability of the network to a complex scene is reduced; according to the invention, the fifth attention module integrates and utilizes the intermediate information of the other four attention modules, so that the effect of front-back cooperative modulation of the fifth attention module and the other four attention modules is realized, the modulation consistency and the overall performance are enhanced, and a test result shows that the accuracy of the network for identifying the age-crossing face is obviously improved after the fifth attention module provided by the invention is adopted.

Drawings

Fig. 1 is a feature extraction network model architecture diagram of embodiment 1;

FIG. 2 is a diagram showing the internal structure of a deep tone feature extraction cell in example 1;

FIG. 3 is an internal structural diagram of a first attention module in example 1;

FIG. 4 is a diagram showing the internal architecture of the multi-scale fusion unit in example 1;

FIG. 5 is an internal structural view of a fourth attention module in example 1;

FIG. 6 is an internal structural view of a fifth attention module in example 1;

FIG. 7 is a diagram showing the internal structure of a deep tone feature extraction cell in example 2;

in the drawings:

1-face image, 2-basic operation layer, 3-deep tone feature extraction monomer, 31-first attention module, 311-hierarchical pooling layer, 312-hierarchical activation function, 32-second attention module, 33-third attention module, 34-fourth attention module, 341-branch pooling layer, 342-full-connection layer, 343-activation layer, 344-full-connection layer, 345-activation layer, 35-fifth attention module, 36-multi-scale fusion unit, 37-dimension-changing unit, 371-dimension-changing rolling layer, 372-dimension-changing activation layer, 373-dimension-changing pooling layer, 4-shaping pooling layer, 5-main-stem full-connection layer and 6-shaping activation layer.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

example 1:

fig. 1 shows a structure diagram of a feature extraction network model in this embodiment, and the entire network model is implemented on a computer by programming Python in combination with a pitorch framework. After the face image 1 is input into the network model, firstly, convolution operation is carried out through the basic operation layer 2, and then an output basic characteristic diagram is generated. The convolution kernel size of the base operation layer 2 is 3*3 with a step size of 1. The number of the deep tone feature extraction monomers 3 is 5, the plurality of deep tone feature extraction monomers 3 are sequentially connected end to end, the feature information sequentially passes through each deep tone feature extraction monomer 3, the width and height of each feature map are reduced by half when each feature map passes through one deep tone feature extraction monomer 3, and the number of channels is doubled. The feature shaping unit comprises a shaping pooling layer 4, a trunk full-connection layer 5 and a shaping activation layer 6 which are connected in sequence, the shaping pooling layer 4 is used for performing global average pooling operation on the feature map in the space direction, and the shaping activation layer is a sigmoid function. Let the size of the face image 1 be W × H × 3 (width × height channel, the same below), the feature map output by each module has the following size:

TABLE 1 feature extraction network model output feature map size for each module

Fig. 2 shows an internal architecture diagram of the single deep tone feature extraction unit 3 in this embodiment, and a feature diagram of a certain single deep tone feature extraction unit 3 is input

And the size is K G C, and the sizes of the first feature map, the second feature map, the third feature map, the fourth feature map, the fifth feature map and the side branch feature map are all K G C. The dimension-changing unit 37 comprises a dimension-changing convolution layer 371, a dimension-changing activation layer 372 and a dimension-changing pooling layer 373 which are arranged in sequence, the dimensions of output characteristic diagrams of the dimension-changing convolution layer 371 and the dimension-changing activation layer 372 are K G2C, and the dimension-changing pooling layer 373 outputs the characteristic diagrams

Has the size of K/2*G/2 x 2C.

The internal operation processes of the first attention module 31, the second attention module 32 and the third attention module 33 are the same, and as shown in fig. 3, a hierarchical pooling layer 311 and a hierarchical activation function 312 are provided, which are connected in sequence, the hierarchical pooling layer 311 is used for performing global maximum pooling operation on feature maps in the channel direction, the hierarchical activation function 312 is a sigmoid function, and the outputs of the hierarchical pooling layer 311 and the hierarchical activation function 312 are matrices with a size K G1. First Process feature map

Second process characteristic diagram

And a third process profile

The matrices, which are the output of the hierarchical pooling layer 311 in the first attention module 31, the second attention module 32, and the third attention module 33, respectively, after operation are all also K × G × 1.

FIG. 4 is a diagram illustrating the internal architecture of the multi-scale fusion unit 36 in this embodiment, and the feature map is first obtained by adding and multiplying the corresponding elements

、

And

preliminary fusion, the first fusion characteristic diagram generated

And a second melting profile

All the dimensions are K G C. Then convolution and a fourth activation function are operated through splicing and a fourth convolution

Activating to obtain a third fusion characteristic diagram of the second fusion

(size K G C). Finally, convolution sum is operated through splicing and fifth convolution

Activation of will

、

、

And

fusing to generate side branch feature map

. The multi-scale fusion unit 36 transforms the feature map in a progressive, multi-pass, multi-dimensional manner

、

And

the method has the advantages of high efficiency and fine denoising capability.

Fig. 5 shows an internal architecture diagram of the fourth attention module 34 in the present embodiment, and the internal architecture diagram is provided with a branch pooling layer 341, an incoming full-link layer 342, an incoming active layer 343, an outgoing full-link layer 344, and an outgoing active layer 345, which are connected in sequence. The branch pooling layer 341 is used for performing global maximum pooling operation on the feature map in the spatial direction, and the fourth process feature map

I.e., the vector output after the operation of the branched pooling layer 341, has a size of 1 × c. The number of input elements of the introduced full-connection layer 342 is C, the number of output elements is C/8, the introduced activation layer 343 is a nonlinear activation function ReLU, the number of input elements of the introduced full-connection layer 344 is C/8, the number of output elements is C, and the introduced activation layer 345 is a nonlinear activation function sigmoid. The fifth process feature map is a vector of 1 × c size that is output after the lead-out active layer 345 is activated.

Fig. 6 shows an internal architecture diagram of the fifth attention module 35 in this embodiment, and a first internal reference feature diagram obtained by splicing the first process feature diagram, the second process feature diagram, and the third process feature diagram

The size is K × G × 3. The number of input elements of the first bridge full connection layer and the second bridge full connection layer is C, the number of output elements is 3, the first bridging activation function, the second bridging activation function and the integration activation function are sigmoid functions and are generated

And

all with a size of 1 x 3, by element-to-element product operation,

and

are respectively as

And distributing weight parameters with different sizes for each layer. The integrated pooling layer is used for performing global maximum pooling operation on the feature maps in the channel direction, and the integrated active layer outputs an integrated attention map with the size K G1. Integration of attention map and fifth feature map

After the element-corresponding product operation, the integrated attention diagram is formed

And different spatial positions are distributed with different weight parameters to realize modulation.

In practical application, the face image 1 inputs the trained feature extraction network model, and a corresponding face feature vector is obtained through feature extraction, and then the distance (L1 distance in this embodiment) between the feature vector and a vector in a preset sample library is calculated, and an identity corresponding to a sample vector which is closest to the face feature vector and has a distance smaller than a preset threshold is the identity of the face image 1, so that face recognition is completed. In the embodiment, the data set VGGFace2 is used as a training set to train the network model, and the loss function adopts a ternary loss function. Then, a common age-spanning face recognition test data set CPLFW is used as a test set to test the model, and the test result shows that the recognition accuracy of the feature extraction network model of the embodiment on CPLFW is 94.71%, while in the existing advanced algorithm, the recognition accuracy of VGGFace2 is 84.00%, and the recognition accuracy of ArcFace is 88.36%, which are lower than those of the embodiment.

Example 2:

in the embodiment, only the internal structure of the deep tone feature extraction unit 3 is modified on the basis of the embodiment 1, and other parts of the network model are kept unchanged. Fig. 7 shows an internal architecture diagram of the deep tone feature extraction cell 3 in example 2, and the fifth attention module 35 was removed for comparative experiments compared to example 1. After the same training and testing, the result shows that the recognition accuracy of the network model in example 2 on CPLFW is 89.24%, which is lower than that in example 1, and it fully illustrates that the fifth attention module 35 in the present invention has an important promoting role in the network model.

The above embodiments only express specific embodiments of the present invention, and the description is specific and detailed, but not to be understood as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. A face feature extraction method is characterized by comprising the following steps: the method comprises the following steps:

s600, inputting the final stage feature map into the feature shaping unit, and generating a face feature vector after shaping operation is carried out on the final stage feature map to finish face feature extraction;

wherein,

a precursor feature map representing input of said deep tone feature extraction singlets,

、

and

、

、

、

and

it is shown that the elements correspond to a product operation,

、

and

、

and

respectively represent

、

And

a multi-scale fusion unit is represented,

a fifth profile obtained by adding the calibrated sidebranch profile to the fourth profile for the fourth attention module,

a first process feature map output from the first attention module,

a second process feature map output from the second attention module,

a third process feature map output from the third attention module,

and

representing a fourth process signature and a fifth process signature Cheng Tezheng, respectively, said fourth process signature and said fifth process signature each being output from said fourth attention module,

and representing the back-driving feature map of the output of the deep tone feature extraction monomer.

2. The face feature extraction method of claim 1, wherein: the convolution kernel sizes of the first convolution operation, the second convolution operation and the third convolution operation are 3*3, and the step sizes are 1; the first activation function, the second activation function, and the third activation function are all ReLU functions.

3. The face feature extraction method of claim 1, wherein: the internal operation process of the multi-scale fusion unit is expressed as the following mathematical model:

wherein,

、

and

it is shown that the elements correspond to a product operation,

it is shown that the feature maps are stitched together,

and

and

representing a third melting profile generated after activation of the fourth activation function.

4. The face feature extraction method of claim 1, wherein: the first attention module, the second attention module and the third attention module are respectively provided with a hierarchical pooling layer and a hierarchical activation function, the hierarchical pooling layer is arranged at the upstream end of the hierarchical activation function and is used for performing global maximum pooling operation on the feature map in the channel direction, and the hierarchical activation function is sigmoid;

5. The face feature extraction method of claim 4, wherein: a branch pooling layer, a lead-in full-connection layer, a lead-in activation layer, a lead-out full-connection layer and a lead-out activation layer are sequentially arranged in the fourth attention module; the branch pooling layer is used for performing global maximum pooling operation on the feature map in the space direction, the lead-in activation layer is a nonlinear activation function ReLU, and the lead-out activation layer is a nonlinear activation function sigmoid;

6. The face feature extraction method of claim 5, wherein: the mathematical model of the fifth attention module is:

wherein,

、

、

、

and

the first process signature, the second process signature, the third process signature, the fourth process signature, and a fifth cross Cheng Tezheng, respectively, representing inputs to a fifth attention module,

showing the splicing operation to the characteristic diagrams therein,

and

、

and

respectively representing a first bridge activation function and a second bridge activation functionTwo bridging activation functions and an integrating activation function,

and

it is shown that the elements correspond to a product operation,

representing an integrated pooling layer for global maximal pooling of feature maps in the channel direction,

an integrated attention map of the fifth attention module output is represented.

7. The face feature extraction method of claim 1, characterized in that: the variable-dimension unit comprises a variable-dimension convolution layer, a variable-dimension activation layer and a variable-dimension pooling layer which are sequentially arranged, the convolution kernel size of the variable-dimension convolution layer is 3*3, the step length is 1, the variable-dimension activation layer is a ReLU function, the variable-dimension pooling layer is used for performing maximum pooling operation on the feature map, and the pooling window size of the variable-dimension pooling layer is 2*2, and the step length is 2.

8. The face feature extraction method of claim 1, wherein: the feature shaping unit comprises a shaping pooling layer, a trunk full-connection layer and a shaping activation layer which are sequentially connected, wherein the shaping pooling layer is used for performing global average pooling operation on the feature map in the space direction, and the shaping activation layer is a sigmoid function.

9. A facial feature extraction apparatus comprising a processor and a memory, the memory storing a computer program, characterized in that: the processor is configured to execute the method for extracting human face features according to any one of claims 1 to 8 by loading the computer program.

10. A storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the face feature extraction method according to any one of claims 1 to 8.