CN115984949A

CN115984949A - Low-quality face image recognition method and device with attention mechanism

Info

Publication number: CN115984949A
Application number: CN202310272773.4A
Authority: CN
Inventors: 梁海丽
Original assignee: Weihai Vocational College
Current assignee: Weihai Vocational College
Priority date: 2023-03-21
Filing date: 2023-03-21
Publication date: 2023-04-18
Anticipated expiration: 2043-03-21
Also published as: CN115984949B

Abstract

The invention discloses a low-quality face image recognition method and equipment with an attention mechanism, and relates to the technical field of artificial neural networks and face recognition. The feature information extraction layer of the invention adopts convolution operation with different convolution kernel sizes and a plurality of feature fusion modes, and is matched with a plurality of step convolution and pooling layers to achieve the effect of strengthening the image feature extraction. Tests show that the network model provided by the invention has a good recognition effect on low-quality face images, and compared with the prior art, the network model makes an obvious progress.

Description

Low-quality face image recognition method and device with attention mechanism

Technical Field

The invention belongs to the technical field of artificial neural networks and face recognition, and particularly relates to a low-quality face image recognition method and device with an attention mechanism.

Background

Since face recognition is an important research direction in the field of image technology, many neural network-based models have been able to achieve over 99% recognition accuracy on high-quality face image data sets with the development of deep learning technology. However, the low-quality face image recognition is still a problem with great difficulty, especially for the low-resolution face image, the existing algorithm is not mature enough, and the recognition accuracy rate is to be further improved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a low-quality face image recognition method and equipment with an attention mechanism so as to improve the recognition accuracy of the low-quality face image.

In order to achieve the purpose, the invention adopts the following solution: a low-quality face image recognition method with an attention mechanism comprises the following steps:

s10, acquiring an unknown face image, and acquiring a trained face image recognition network model;

the human face image recognition network model comprises a trunk and a global pooling processing layer, the global pooling processing layer is connected to the tail of the trunk, the trunk comprises a plurality of feature information extraction layers which are connected in sequence, and the mathematical model of the feature information extraction layers is as follows:

wherein is present>

Represents an input of the characteristic information extraction layer, is selected>

Representing a pre-residual module; />

Represents a convolution operation with a convolution kernel size of 3 x 3 and a step size of 1, and->

Represents a convolution operation with a convolution kernel size of 5 x 5 and a stride of 1, and->

Denotes the convolution kernel size of 3 ×3. Convolution operation with a step size of 2->

Represents a convolution operation with a convolution kernel size of 5 x 5 and a step size of 2, and->

And &>

Each representing a maximum pooling operation with a pooling window size of 2 x 2, step size of 2>

Represents a convolution operation with a convolution kernel size of 1 x 1 and a step size of 1, and->

Indicates that the element corresponds to a product operation, and->

Showing the splicing together of the feature maps therein,

、/>

、/>

、/>

、/>

both represent an activation function ReLU, <' >>

A characteristic map representing the output of the pre-residual module, based on the comparison result of the pre-residual module>

Represents->

Characteristic map generated after activation, ->

Represents->

Characteristic map generated upon activation>

Represents->

And/or>

The feature map generated after the addition is taken>

Represents->

And/or>

A characteristic map generated after the corresponding multiplication of the elements is made, is based on the result of the evaluation>

Represents->

Characteristic map generated after activation, ->

Represents->

Characteristic map generated after activation, ->

Represents->

Characteristic map generated after pooling operation,. Sup.>

Represents->

Feature maps generated after pooling operations>

Represents a combined attention calibration unit>

Represents an attention diagram generated and output by the combined attention calibration unit, -a>

A feature map representing an output of the feature information extraction layer;

s20, inputting the unknown face image into the face image recognition network model, and carrying out feature extraction operation on image information by each feature information extraction layer in sequence along with the transfer of the image information along the backbone until an abstract feature image is output by the last feature information extraction layer;

s30, inputting the abstract feature map into the global pooling processing layer, performing global pooling operation on each layer of the abstract feature map by using the global pooling processing layer, and outputting to obtain a face feature vector;

s40, calculating the distance between the face feature vector and all target feature vectors in a retrieval library, wherein the identity corresponding to the target feature vector which is closest to the face feature vector and meets the threshold condition is the identity of the unknown face image.

Further, the global pooling processing layer is a global average pooling layer.

Further, the mathematical model of the composite attention calibration unit is:

wherein the combined attention calibration unit holds the characteristic map>

、/>

、/>

And &>

As an input; />

And &>

All represent global max pooling operations on the feature map, device for selecting or keeping>

Is to operate the characteristic map in the spatial direction, based on the characteristic map>

Is to operate the characteristic map in the direction of the channel, based on the result of the evaluation of the characteristic map>

Showing the splicing together of the feature maps therein; />

Convolution operation representing convolution kernel size of 1 × 1 and step size of 1; />

Indicates that the element corresponds to a product operation, and->

And &>

Each represents a sigmoid activation function, <' >>

Represents->

The feature map generated after the convolution operation is->

Represents->

Feature map generated after function activation, ->

Representing a feature map>

、/>

、/>

And &>

Respectively performing global maximum pooling in the channel direction, and splicing to obtain characteristic maps, and selecting the corresponding characteristic map based on the characteristic map>

An attention map of the output of the composite attention calibration unit is represented.

Further, the mathematical model of the pre-residual module is:

wherein is present>

Represents the input of the pre-residual module, is combined with the input of the pre-residual module>

And &>

Each represents an activation function ReLU +>

And &>

Each represents a convolution operation with a convolution kernel size of 3 x 3 and a step size of 1, and->

Represents->

Activation of the generated characteristic map +>

And representing the characteristic graph output by the preposed residual error module.

The invention also provides a low-quality face image recognition device with an attention mechanism, which comprises a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the low-quality face image recognition method with the attention mechanism by loading the computer program.

The invention has the beneficial effects that:

(1) The prior art shows that for a clear high-resolution face image, required characteristic information can be fully extracted and obtained from the image through a plurality of convolution layers which are simply superposed, while for a low-quality (such as low-resolution) face image, useful face image information is very limited and is often included in a large amount of interference signals, the existing artificial neural network is difficult to well deal with the low-quality image input, and the data fitting effect is poor; in order to improve the utilization rate of the original input image, the invention not only arranges convolution layers (with different convolution kernel sizes) in each characteristic information extraction layer

And &>

) And also in different ways>

And &>

The feature graphs output by the two convolutions are fused (matrix addition and element corresponding product), so that the feature information extraction layer can perform feature extraction operation from different angles, and the feature extraction operation from different angles is mutually supplemented and verified, thereby achieving the effect of enhancing the image feature extraction effect;

(2) In the conventional face recognition neural network, the feature extraction units and the pooling layer are alternately arranged, and the whole network is of a serial structure, and only one pooling layer exists between the two feature extraction units, so that the information pooling mode is single, and the feature extraction effect of the whole network on low-quality images is limited; the invention sets a plurality of step convolutions (

And &>

) And pooling layer (` Harbin `)>

And &>

) The characteristic diagrams of different branches are respectively operated, so that the length and width of the characteristic diagrams are reduced, and information loss caused by a single pooling mode is well relieved;

(3) Feature map generated by multiple branches (

、/>

、/>

And &>

) Parallel input into the composite attention calibration unit, the composite attention calibration unit can learn the importance of different information from different angles and the resulting attention diagram (or @) compared to a conventional single profile as input>

) The information calibration effect is better.

Drawings

FIG. 1 is a schematic diagram of a structure of a face image recognition network model according to an embodiment;

FIG. 2 is a schematic structural diagram of a feature information extraction layer in the face image recognition network model shown in FIG. 1;

FIG. 3 is a schematic diagram of a composite attention calibration unit in the feature information extraction layer shown in FIG. 2;

FIG. 4 is a schematic view of a structure of a characteristic information extraction layer in a comparative example;

in the drawings: the method comprises the steps of 1-unknown face image, 2-feature information extraction layer, 21-preposed residual module, 3-global pooling processing layer, 4-composite attention calibration unit and 5-face feature vector.

Detailed Description

Example (b):

as shown in the drawings of the specification, fig. 1, fig. 2, and fig. 3 are a schematic structural diagram of a face image recognition network model, a schematic structural diagram of a feature information extraction layer 2, and a schematic structural diagram of a composite attention calibration unit 4, respectively, according to this embodiment. The global pooling processing layer 3 is realized by adopting a global average pooling layer, four feature information extraction layers 2 are arranged in a network backbone, and the feature information extraction layers 2 can be expressed as the following mathematical models:

in which>

Representing a pre-residual module; />

Represents a convolution operation with a convolution kernel size of 5 x 5 with a step size of 1, and->

Represents a convolution operation with a convolution kernel size of 3 x 3 and a step size of 2, and->

Represents the convolution operation with convolution kernel size of 5 x 5 and step size of 2,

and &>

Each represents a maximum pooling operation with a pooling window size of 2 x 2 and a step size of 2, and->

Indicates that the element corresponds to a product operation, and->

Showing the splicing together of the feature maps therein,

、/>

、/>

、/>

、/>

each represents an activation function ReLU +>

Represents->

Characteristic map generated upon activation>

Represents->

Characteristic map generated after activation, ->

Represents->

And/or>

The feature map generated after the addition is taken>

Represents->

And/or>

Characteristic map generated after element corresponding product is taken, and>

represents->

ActivationPost-generated feature map>

Represents->

Characteristic map generated after activation, ->

Represents->

Characteristic map generated after pooling operation,. Sup.>

Represents->

Characteristic map generated after pooling operation,. Sup.>

Represents a combined attention calibration unit>

Represents an attention map generated and output by the combined attention calibration unit, <' >>

And a feature map representing an output of the feature information extraction layer.

For the first feature information extraction layer 2, the unknown face image 1 (the number of channels is 3) is input, and the first convolution operation in the pre-residual module 21 is performed (i.e., (1)

) Then, the output obtains a feature map with the channel number of 48 (the length and width are equal to the unknown face image 1), and the second convolution operation (based on the length and width of the unknown face image 1)>

) Before and after, the size of the characteristic diagram is kept unchanged. For the following three feature information extraction layers 2, in the pre-residual module 21Two convolution operations (` vs `)>

And &>

) Before and after, the length and width dimensions of the characteristic diagram and the number of channels are kept unchanged.

In all of the feature information extraction layers 2,

and &>

Before and after two convolution operations, the length and width sizes and the channel number of the feature map are kept unchanged, so that the feature map is/are reserved in the same feature information extraction layer 2>

And the characteristic map->

Are all the same in size, are combined in a manner known per se>

And/or>

Add to obtain>

，/>

And/or>

Make the corresponding product of elements to get->

. By means of pairs>

、/>

、/>

And &>

Fills in the appropriate padding value so that->

、/>

、/>

And &>

After operation, the characteristic diagram is obtained

、/>

、/>

And &>

The length and the width of the paper are (in the same characteristic information extraction layer 2)>

Half of the characteristic map, is taken>

、/>

、/>

And &>

Of (2)The number is equal to (in the same characteristic information extraction layer 2) <>

The characteristic diagrams are equal. Then by splicing and->

Convolution operation of feature map>

、/>

、/>

And &>

Fusion, the finally output characteristic map->

The number of channels is (in the same feature information extraction layer 2) the feature map pick>

Characteristic map ^ 2 times the number of channels>

The length and width of the feature map (in the same feature information extraction layer 2) is a feature map>

Half the length and width dimensions.

For the composite attention calibration unit 4,

、/>

、/>

and &>

After the input, on the one hand, by means of splicing and->

Convolution to obtain a feature map with a channel number of 4 +>

(ii) a A global max pooling operation is then performed in the spatial direction,

after the function has been activated, a vector of length 4 is obtained>

. On the other hand, each is individually paired->

、/>

、/>

And

performing global maximum pooling operation in the channel direction, and splicing the obtained 4 two-dimensional matrices to obtain feature maps with the channel number of 4>

. Then the vector is->

And &>

Performing a multiplication operation corresponding to the element, and using the result>

Is paired and/or matched>

Modulating each layer; finally utilizes>

Convolution compresses the channel number to 1, pass->

Function activation, generating an attention map->

。/>

Is a two-dimensional matrix whose length and width dimensions correspond to those of the input->

Characteristic map equals->

And/or>

And performing element corresponding product on the feature diagram obtained after the function activation to realize the calibration of feature information of different positions in the space direction of the feature diagram. The combined attention calibration unit 4 uses the vector ≥>

Is paired and/or matched>

Each layer of (a) is modulated, thus not only utilizing

、/>

、/>

And &>

The characteristic relation of different positions in the space directionThe feature relation in the channel direction is also utilized, the utilization rate of the input information is high by the composite attention calibration unit 4, and the feature map is more fully mined and learned>

、/>

、/>

And &>

The relative relation of different position information and the characteristic diagram can realize more effective calibration after element corresponding product operation.

After the image information is transmitted on the network backbone, the number of the abstract feature map channels output by the last feature information extraction layer 2 is 768, and after global average pooling processing is performed on each layer of the abstract feature map, a face feature vector 5 with the length of 768 is generated. The length of all target feature vectors in the retrieval library is 768, and corresponding target feature vectors are obtained by inputting high-definition face images into the trained face image recognition network model. In this embodiment, the calculated euclidean distance between the face feature vector 5 and all target feature vectors in the search library, and the identity corresponding to the target feature vector which is closest to the face feature vector 5 and whose euclidean distance is smaller than the threshold value is the identity of the unknown face image 1. And if the Euclidean distances between the face feature vector 5 and all the target feature vectors in the search library are larger than or equal to the threshold value, judging that the unknown face image 1 is not in the search library.

In this embodiment, the SCface data set is used to train and test the network model, and the SCface data set includes high-definition face images and low-resolution face images of 130 persons. The high-definition face image of each person is only 1, the low-resolution face image is 15, and the 15 images are obtained by arranging 5 cameras at three different distances (1 meter, 2.6 meters and 4.2 meters). In specific implementation, 3 images are randomly extracted from 5 images captured at each distance, a training set including 1170 images is formed, and the remaining 780 low-resolution face images are used as a test set. And in the training process, optimizing the network model by adopting a ternary loss function. After training is finished, high-definition images of 130 persons are input into the network model, the output feature vectors form target feature vectors in a search library, and then the images in the test set are input into the trained network model for testing. For comparison, the present embodiment also trains the currently advanced low-resolution face recognition model MIND-Net by using the same training set, and tests are performed on the same test set, and the comparison result is shown in Table 1.

TABLE 1 examples and MIND-Net model comparative results were tested on the test set

Compared with the final recognition accuracy, the recognition accuracy of the human face image with different resolutions is higher than MIND-Net, and particularly for the image with lower resolution (shot at a distance of 4.2 meters), the human face image recognition accuracy is greatly improved.

Comparative example:

this comparative example is intended to more fully illustrate the role of the composite attention calibration unit 4 proposed by the present invention in the overall model. In this comparative example, the composite attention calibration unit 4 in the example is removed, and the CBAM module is used to calibrate the feature map, and the new feature extraction layer 2 structure is shown in fig. 4. The rest of the network model remains unchanged, the training and testing process is also completely consistent with the embodiment, and the modified network model test results are shown in table 2.

Table 2 comparative examples test results on test set

Comparing the data of the above two tables, it can be seen that, under the condition of relatively high resolution, the feature information in the original input image is relatively rich, and the composite attention calibration unit 4 has a relatively limited performance improvement. However, under the condition of relatively low resolution, it becomes important to fully utilize the original few high-value information, and the effect of the composite attention calibration unit 4 on improving the network performance is very obvious.

The above-mentioned embodiments only express the specific embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that various changes and modifications can be made by those skilled in the art without departing from the spirit of the invention, and these changes and modifications are all within the scope of the invention.

Claims

1. A low-quality face image recognition method with attention mechanism is characterized in that: the method comprises the following steps:

wherein is present>

An input representing the feature information extraction layer,

representing a pre-residual module; />

Represents the convolution operation with convolution kernel size of 3 x 3 and step size of 1, and/or the convolution operation of the convolution kernel size of more than one step size>

Represents the convolution operation with convolution kernel size of 3 x 3 and step size of 2, and/or the convolution operation of the convolution kernel size of more than 3 x 3>

Represents a convolution operation with a convolution kernel size of 5 x 5 and a stride of 2, and->

And &>

Indicates that the element corresponds to a product operation, and->

Means for concatenating the characteristic maps thereof>

、/>

、/>

、/>

、/>

Both represent an activation function ReLU, <' >>

A feature map representing the output of said pre-residual module, based on a pre-determined characteristic value>

Represents->

The feature map generated after the activation is carried out,

represents->

Characteristic map generated after activation, ->

Represents->

And/or>

The feature map generated after the addition is taken>

Represents->

And/or>

Represents->

Characteristic map generated upon activation>

To represent

Characteristic map generated upon activation>

Represents->

Characteristic map generated after pooling operation,. Sup.>

Represents->

Feature maps generated after pooling operations>

Represents a combined attention calibration unit>

s40, calculating the distances between the face feature vectors and all target feature vectors in a retrieval library, wherein the identity corresponding to the target feature vector which is closest to the face feature vector and meets the threshold condition is the identity of the unknown face image.

2. The method of claim 1, wherein the method comprises the following steps: the global pooling processing layer is a global average pooling layer.

3. The method of claim 1, wherein the method comprises the steps of: the mathematical model of the composite attention calibration unit is:

wherein the combined attention calibration unit holds the characteristic map>

、/>

、/>

And &>

As an input; />

And &>

All represent a global max pooling operation on the feature map, device for selecting or keeping>

Operating on a characteristic map in the direction of a channel>

Showing the splicing together of the feature maps therein; />

Indicates that the element corresponds to a product operation, and->

And &>

Each represents a sigmoid activation function, <' >>

Represents->

The feature map generated after the convolution operation is->

Represents->

Feature map generated after function activation, ->

Indicates that the characteristic map is to be taken>

、/>

、/>

And &>

An attention map representing the output of the composite attention calibration unit.

4. The method of claim 1, wherein the method comprises the following steps: the mathematical model of the preposed residual error module is as follows:

in which>

Represents the input of the pre-residual module, is greater than or equal to>

And &>

Both represent an activation function ReLU, <' >>

And &>

All represent convolution kernel sizes of 3 x 3,Convolution operation with a step size of 1, -and>

represents->

Activation of a generated feature map>

5. A low-quality face image recognition device with attention mechanism is characterized in that: comprising a processor and a memory, said memory storing a computer program, said processor being adapted to perform the method of low quality face image recognition with attention mechanism according to any of claims 1-4 by loading said computer program.