CN115984949B

CN115984949B - Low-quality face image recognition method and equipment with attention mechanism

Info

Publication number: CN115984949B
Application number: CN202310272773.4A
Authority: CN
Inventors: 梁海丽
Original assignee: Weihai Vocational College
Current assignee: Weihai Vocational College
Priority date: 2023-03-21
Filing date: 2023-03-21
Publication date: 2023-07-04
Anticipated expiration: 2043-03-21
Also published as: CN115984949A

Abstract

The invention discloses a low-quality face image recognition method and equipment with an attention mechanism, and relates to the technical field of artificial neural networks and face recognition. The characteristic information extraction layer adopts convolution operation with different convolution kernel sizes and various characteristic fusion modes, and is matched with a plurality of stride convolution layers and pooling layers, so that the effect of enhancing the image characteristic extraction effect is achieved. Tests show that the network model provided by the invention has a good recognition effect on low-quality face images, and compared with the prior art, the network model has obvious progress.

Description

Low-quality face image recognition method and equipment with attention mechanism

Technical Field

The invention belongs to the technical field of artificial neural networks and face recognition, and particularly relates to a low-quality face image recognition method and equipment with an attention mechanism.

Background

In the past, face recognition is an important research direction in the field of image technology, and with the development of deep learning technology, many models based on neural networks can obtain more than 99% of recognition accuracy on a high-quality face image dataset. However, low quality face image recognition is still a problem with great difficulty, and especially for low resolution face images, the existing algorithm is still not mature enough, and the recognition accuracy is still to be further improved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a low-quality face image recognition method and equipment with an attention mechanism so as to improve the recognition accuracy of the low-quality face image.

In order to achieve the above object, the present invention adopts the following solutions: a method for recognizing a low quality face image with an attention mechanism, comprising the steps of:

s10, acquiring an unknown face image, and acquiring a trained face image recognition network model;

the face image recognition network model comprises a trunk and a global pooling processing layer, wherein the global pooling processing layer is connected to the tail of the trunk, the trunk comprises a plurality of characteristic information extraction layers connected in sequence, and the mathematical model of the characteristic information extraction layers is as follows:

wherein->

Input representing the feature information extraction layer, < >>

Representing a pre-residual module; />

A convolution operation with a convolution kernel of 3*3 and a step size of 1 is represented by +.>

A convolution operation with a convolution kernel of 5*5 and a step size of 1 is represented by +.>

A convolution operation with a convolution kernel size of 3*3 and a step size of 2 is represented by +.>

A convolution operation with a convolution kernel size of 5*5 and a step size of 2 is represented by +.>

And->

All represent maximum pooling operation with pooling window size 2 x 2 and step size 2,/->

A convolution operation with a convolution kernel of 1*1 and a step size of 1 is represented by +.>

Representing element-corresponding product operation,/->

Representing stitching of the characteristic diagrams therein, +.>

、/>

、/>

、/>

、/>

All represent activation functions ReLU, ">

A characteristic diagram representing the output of the pre-residual module,>

representation->

Feature map generated after activation, ++>

Representation->

Feature map generated after activation, ++>

Representation->

And->

Feature map generated after addition, ++>

Representation->

And->

Feature map generated after element corresponding product is made, < ->

Representation->

Feature map generated after activation, ++>

Representation->

Feature map generated after activation, ++>

Representation of

Feature map generated after pooling operation, +.>

Representation->

Feature map generated after pooling operation, +.>

Representing a compound attention calibration unit,/->

Representing the attention force diagram generated and output by the compound attention calibration unit, < >>

A feature map indicating the output of the feature information extraction layer;

s20, inputting the unknown face image into the face image recognition network model, and sequentially carrying out feature extraction operation on the image information by each feature information extraction layer along with the transmission of the image information along the trunk until the last feature information extraction layer outputs an abstract feature image;

s30, inputting the abstract feature map into the global pooling processing layer, performing global pooling operation on each layer of the abstract feature map by using the global pooling processing layer, and outputting to obtain a face feature vector;

s40, calculating the distance between the face feature vector and all target feature vectors in the search library, wherein the identity corresponding to the target feature vector which is closest to the face feature vector and meets the threshold condition is the identity of the unknown face image.

Further, the global pooling processing layer is a global average pooling layer.

Further, the mathematical model of the compound attention calibration unit is:

wherein the compound attention calibration unit is characterized by +.>

、/>

、

And->

As input; />

And->

All represent global maximum pooling operations on feature graphs,>

is to operate the feature map in the spatial direction, +.>

Is to operate the feature map in the channel direction, +.>

The representation splices the characteristic diagrams; />

A convolution operation with a convolution kernel size of 1*1 and a step length of 1 is represented; />

Representing element-corresponding product operation,/->

And->

All represent sigmoid activation functions, +.>

Representation of

Feature map generated after convolution operation, +.>

Representation->

Feature map generated after function activation, +.>

Representing the characteristic diagram

、/>

、/>

And->

Respectively carrying out global maximum pooling on the channel directions and splicing to obtain a characteristic diagram, namely ++>

Representing an attention force map output by the compound attention calibration unit.

Further, the mathematical model of the pre-residual module is:

wherein->

Input representing the pre-residual module, < >>

And->

All represent activation functions ReLU, ">

And->

All represent convolution operations with convolution kernel size 3*3 and step size 1, +.>

Representation->

Activating the generated feature map, < >>

Representing the characteristic diagram output by the prepositive residual error module.

The invention also provides a low-quality face image recognition device with an attention mechanism, comprising a processor and a memory, the memory storing a computer program, the processor being arranged to perform the low-quality face image recognition method with an attention mechanism as described above by loading the computer program.

The beneficial effects of the invention are as follows:

(1) The prior art shows that for clear high-resolution face images, required characteristic information can be fully extracted from the images through a plurality of simply overlapped convolution layers, and for low-quality (such as low-resolution) face images, useful face image information in the face images is very limited, and the information is often mixed in a large amount of interference signals, so that the conventional artificial neural network is difficult to well cope with the low-quality image input, and the data fitting effect is poor; in order to improve the utilization rate of the original input image, the invention not only arranges convolution layers with different convolution kernel sizes in each characteristic information extraction layer

And->

) But also to apply +.>

And->

The feature images output by the two convolutions are fused (the products of matrix addition and element correspondence), so that the feature information extraction layer can perform feature extraction operations from different angles, and the feature extraction operations from different angles are mutually complemented and verified, thereby achieving the effect of strengthening the image feature extraction effect;

(2) In the conventional face recognition neural network, the feature extraction units and the pooling layers are alternately arranged, and as the whole network is of a serial structure, only one pooling layer exists between the two feature extraction units, so that the information pooling mode is single, and the feature extraction effect of the whole network on low-quality images is limited; the invention simultaneously sets a plurality of stride convolutions

And->

) And pooling layer ()>

And->

) The feature graphs of different branches are operated respectively, so that the length and width of the feature graphs are reduced, and meanwhile, the information loss caused by a single pooling mode is well relieved;

(3) Feature map generated by multiple branches

、/>

、/>

And->

) In the parallel input compound attention calibration unit, compared with the conventional single feature map as input, the compound attention calibration unit can learn the importance of different information from different angles, and finally obtain an attention map (>

) The information calibration effect is better.

Drawings

FIG. 1 is a schematic diagram of a face image recognition network model according to an embodiment;

FIG. 2 is a schematic structural diagram of a feature information extraction layer in the face image recognition network model shown in FIG. 1;

FIG. 3 is a schematic structural diagram of a composite attention calibration unit in the feature information extraction layer shown in FIG. 2;

fig. 4 is a schematic structural view of a feature information extraction layer in the comparative example;

in the accompanying drawings: the face recognition system comprises a 1-unknown face image, a 2-feature information extraction layer, a 21-preposed residual error module, a 3-global pooling processing layer, a 4-composite attention calibration unit and a 5-face feature vector.

Detailed Description

Examples:

as shown in the drawings, fig. 1, 2 and 3 are a schematic view of a face image recognition network model structure, a schematic view of a feature information extraction layer 2, and a schematic view of a composite attention calibration unit 4 of the present embodiment, respectively. The global pooling processing layer 3 is implemented by adopting a global average pooling layer, four feature information extraction layers 2 are arranged in a backbone of a network, and the feature information extraction layers 2 can be expressed as the following mathematical model:

wherein->

Input representing the feature information extraction layer, < >>

Representing a pre-residual module; />

And->

Representing element-corresponding product operation,/->

Representing stitching of the characteristic diagrams therein, +.>

、/>

、/>

、/>

、/>

All represent activation functions ReLU, ">

A characteristic diagram representing the output of the pre-residual module,>

representation->

Feature map generated after activation, ++>

Representation->

Activation ofPost-generated feature map, < >>

Representation->

And->

Feature map generated after addition, ++>

Representation->

And->

Feature map generated after element corresponding product is made, < ->

Representation->

Feature map generated after activation, ++>

Representation->

Feature map generated after activation, ++>

Representation of

Feature map generated after pooling operation, +.>

Representation->

Feature map generated after pooling operation, +.>

Representing a compound attention calibration unit,/->

And the characteristic diagram is output by the characteristic information extraction layer.

For the first characteristic information extraction layer 2, the input is an unknown face image 1 (the number of channels is 3), and the unknown face image is subjected to a first convolution operation in a pre-residual error module 21

) Then, a feature map (the length and width dimensions are equal to those of the unknown face image 1) with the number of channels being 48 is output, and a second convolution operation is performed>

) The size of the feature map is kept unchanged from front to back. For the latter three feature information extraction layers 2, two convolution operations (++) in the pre-residual block 21>

And->

) The length and width dimensions and the number of channels of the feature map remain unchanged.

In all of the feature information extraction layers 2,

and->

The length and width dimensions and the channel number of the feature map are kept unchanged before and after the two convolution operations, so that the feature map is +.>

And (4) feature map>

Is of the same size (B)>

And->

Added to get->

，/>

And->

The corresponding product of the elements is->

. By means of->

、/>

、/>

And->

Filling the appropriate padding value such that +.>

、/>

、/>

And->

After the operation, the obtained characteristic diagram->

、/>

、/>

And

the length and width dimensions are (in the same characteristic information extraction layer 2)/(>

Half of the feature map, ++>

、/>

、/>

And->

The number of channels is equal to (in the same feature information extraction layer 2)>

The feature maps are equal. Then by stitching and->

Convolution operation, feature map->

、/>

、/>

And->

Fusion, finally output profile->

The number of channels is (in the same feature information extraction layer 2) feature map +.>

2 times the number of channels, characteristic diagram->

The length and width dimensions of (a) are (in the same feature information extraction layer 2) feature map ∈ ->

Half the length and width dimensions.

For the compound attention calibration unit 4,

、/>

、/>

and->

After input, on one hand, the method comprises the steps of splicing and

convolving to obtain a characteristic diagram of 4 channels +.>

The method comprises the steps of carrying out a first treatment on the surface of the Then a global max pooling operation is performed in the spatial direction,/->

After activation of the function, a vector of length 4 is obtained>

. On the other hand, separately pair->

、/>

、/>

And->

Performing global maximum pooling operation in the channel direction, and then splicing the obtained 4 two-dimensional matrixes to obtain a characteristic diagram with 4 channels>

. The vector is then +.>

And->

Performing element-corresponding product operation by +.>

For->

Modulating each layer of the pattern; finally use->

Convolution compresses the number of channels to 1, via +.>

Function activation, generating attention force diagram->

。/>

Is a two-dimensional matrix with length and width dimensions and input +.>

Characteristic map is equal, +.>

And->

And (3) performing element corresponding product on the feature map obtained after function activation, and calibrating feature information at different positions in the space direction of the feature map. The compound attention calibration unit 4 uses the vector +.>

For->

Is modulated by the individual layers of (a) so that not only +.>

、/>

、/>

And->

The characteristic relations of different positions in the space direction are utilized, and the characteristic relations in the channel direction are utilized, so that the utilization rate of the composite attention calibration unit 4 on the input information is high, and the characteristic diagram can be more fully mined and learned>

、/>

、/>

And->

The relative relation of the different position information and the feature map can realize more effective calibration after element corresponding product operation.

After the image information is transmitted by the network backbone, the number of channels of the abstract feature map output by the last feature information extraction layer 2 is 768, and after global average pooling processing is carried out on each layer of the abstract feature map, a face feature vector 5 with the length of 768 is generated. The length of all target feature vectors in the search library is 768, and the corresponding target feature vectors are obtained by inputting the high-definition face images into the trained face image recognition network model. In this embodiment, the calculated euclidean distance between the face feature vector 5 and all the target feature vectors in the search library is the identity corresponding to the target feature vector which has the closest distance to the face feature vector 5 and the euclidean distance smaller than the threshold value, namely the identity of the unknown face image 1. If the Euclidean distance between the face feature vector 5 and all the target feature vectors in the search library is greater than or equal to the threshold value, the unknown face image 1 is judged not to be in the search library.

In the embodiment, the SCface data set is adopted to train and test the network model, and the SCface data set comprises high-definition face images and low-resolution face images of 130 persons. The high-definition face image of each person is only 1, the low-resolution face image is 15, and the 15 images are obtained by arranging 5 cameras at three different distances (1 meter, 2.6 meters and 4.2 meters). In the specific implementation, 3 images are randomly extracted from 5 images shot from each distance to form a training set containing 1170 images together, and the rest 780 low-resolution face images are used as a test set. In the training process, a ternary loss function is adopted to optimize the network model. After training, inputting the high-definition images of 130 persons into a network model, forming the target feature vectors in a search library by the output feature vectors, and then inputting the images in the test set into the trained network model for testing. In contrast, the present embodiment also trains the present advanced low-resolution face recognition model MIND-Net using the same training set, and tests on the same test set, and the comparison results are shown in Table 1.

Table 1 examples and bond-Net model test comparative results on test set

By comparing the final recognition accuracy, the face image recognition accuracy of the invention is higher than MIND-Net for face images with different resolutions, and particularly for images with lower resolution (4.2 m distance shooting), the face image recognition method and device can achieve great improvement.

Comparative example:

this comparative example is to more fully explain the role of the compound attention calibration unit 4 proposed by the present invention in the entire model. In this comparative example, the composite attention calibration unit 4 in the embodiment is removed, and the CBAM module is used instead to calibrate the feature map, and the new feature extraction layer 2 structure is shown in fig. 4. The rest of the network model remained unchanged, the training and testing procedure was also completely consistent with the examples, and the modified network model test results are shown in table 2.

Table 2 comparative examples test results on test sets

As can be seen from comparing the data of the above two tables, the feature information in the original input image is rich and the composite attention calibration unit 4 is limited in terms of performance improvement at a relatively high resolution. But in the case of relatively low resolution, it becomes particularly important to make full use of the originally not much high-value information, and the effect of the composite attention calibration unit 4 on improving the network performance is very obvious.

The foregoing examples merely illustrate specific embodiments of the invention, which are described in greater detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention.

Claims

1. A low-quality face image recognition method with an attention mechanism is characterized by comprising the following steps: the method comprises the following steps:

wherein->

Input representing the feature information extraction layer, < >>

Representing a pre-residual module; />

And->

Representing element-corresponding product operation,/->

Representing stitching of the characteristic diagrams therein, +.>

、/>

、/>

、/>

、/>

All represent activation functions ReLU, ">

A characteristic diagram representing the output of the pre-residual module,>

representation->

Feature map generated after activation, ++>

Representation->

Feature map generated after activation, ++>

Representation->

And->

Feature map generated after addition, ++>

Representation->

And->

Feature map generated after element corresponding product is made, < ->

Representation->

Feature map generated after activation, ++>

Representation->

Feature map generated after activation, ++>

Representation->

Feature map generated after pooling operation, +.>

Representation->

Feature map generated after pooling operation, +.>

Representing a compound attention calibration unit,/->

s40, calculating the distance between the face feature vector and all target feature vectors in a search library, wherein the identity corresponding to the target feature vector which is closest to the face feature vector and meets the threshold condition is the identity of the unknown face image;

the mathematical model of the composite attention calibration unit is:

wherein the compound attention calibration unit is characterized by +.>

、/>

、/>

And->

As input; />

And->

All represent global maximum pooling operations on feature graphs,>

is to operate the feature map in the spatial direction, +.>

Is to operate the feature map in the channel direction, +.>

The representation splices the characteristic diagrams; />

Representing element-corresponding product operation,/->

And->

All represent sigmoid activation functions, +.>

Representation->

Feature map generated after convolution operation, +.>

Representation->

Feature map generated after function activation, +.>

Representing the characteristic diagram->

、

、/>

And->

2. The method for recognizing a low-quality face image with an attention mechanism according to claim 1, characterized in that: the global pooling processing layer is a global average pooling layer.

3. The method for recognizing a low-quality face image with an attention mechanism according to claim 1, characterized in that: the mathematical model of the prepositive residual error module is as follows:

wherein->

Input representing the pre-residual module, < >>

And->

All represent activation functions ReLU, ">

And->

Representation->

Activating the generated feature map, < >>

4. A low quality face image recognition device with an attention mechanism, characterized by: comprising a processor and a memory, said memory storing a computer program, said processor being adapted to perform the method of low quality facial image recognition with attention mechanism of any of claims 1-3 by loading said computer program.