CN113706404A

CN113706404A - Depression angle human face image correction method and system based on self-attention mechanism

Info

Publication number: CN113706404A
Application number: CN202110899936.2A
Authority: CN
Inventors: 邹华; 斯马依力江·木萨汗; 王中元; 邬欢欢
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-08-06
Filing date: 2021-08-06
Publication date: 2021-11-26
Anticipated expiration: 2041-08-06
Also published as: CN113706404B

Abstract

The invention discloses a method and a system for correcting a depression angle face image based on a self-attention mechanism. In order to fully utilize complementary information of a plurality of pictures, a convolution gating circulating unit is introduced, the plurality of pictures are sequentially input from the picture with a larger deflection angle to the picture with a smaller deflection angle to simulate the process of face correction, and feature associated information of a face can be predicted and extracted from the plurality of face pictures with depression angles better; the face information complementation of a plurality of pictures is achieved. And the generated picture is more similar to a true picture and has richer details compared with a single face correction by carrying out depression angle face correction on a plurality of faces.

Description

Depression angle human face image correction method and system based on self-attention mechanism

Technical Field

The invention belongs to the technical field of computer vision, relates to a depression angle face image correction method and system, and particularly relates to a depression angle face image correction method and system based on a self-attention mechanism.

Background

In modern society, payment, security check, suspect tracking and even office card punching are all very demanding to identify and authenticate each individual, and there are many means for identification and authentication, such as fingerprint identification, genetic identification, etc. Because of the characteristics of less contact, no need of special cooperation of users, capability of collecting information data at a long distance and the like, face recognition has become an identity recognition mode which is most widely applied and most deployed in the current society. Research around human faces has also shown a variety of factors, including determining a person's age by predicting a face, simulating the past appearance of the person by making some changes to the face, and the process of aging in the future, and accurately analyzing the mental state of this type of person by recognizing the face's expression.

The technology for recognizing the front face of a standard human face is quite mature at present, namely, important areas (eyes, nose, mouth and the like) of the face are transferred to a specific area through a certain method, and identity information is not lost. In a controlled and supervised scene, such as a scene passing face verification, a target conscious face is required to be adjusted to a fixed position so as to accurately and effectively obtain a face front image. As recognition accuracy approaches the peak in standard faces, researchers have focused on the transition from controlled face images to uncontrolled natural images. Generally, non-controllable images in life account for a larger proportion; the overall facial changes due to changes in lighting, expression, pose, etc. are insurmountable barriers to the various methods used in standard face recognition.

Particularly in the monitoring field, because the monitoring camera is usually arranged at a high position, the monitored picture is usually a depression angle picture of a human face, if the camera is arranged in a wide space, but when the camera is arranged in a relatively limited space, such as a corner place, the face front image is difficult to obtain, even if the face front image can be obtained in the wide space, people are also in a relatively far place, and the requirement on the camera equipment is high when the face front image is to be extracted; the cost required for such high performance cameras is necessarily high and cannot be applied in all situations; on the other hand, if the frontal image of the human face can be restored from the depression angle picture, the requirements on the camera are not so high and can be applied to almost all cases; therefore, multi-pose face correction takes place as soon as possible.

With the deepening of the dependence degree of people on face information and the diversification of face processing problems, face correction becomes another field which is separated from face recognition. Because the effect of the pose on face recognition is not negligible or even decisive compared to the problems of illumination, expression, resolution, etc., the change of the pose of the face, especially the change of the pose of a large magnitude, makes face recognition become very unstable. Just like any three-dimensional object, the human face can obtain an image of any angle through rotation in three directions, namely a Pitch angle Pitch, a Yaw angle Yaw and a Roll angle Roll. Many of the current researches are directed to Roll and Yaw correction, but the Pitch angle is only slightly corrected, and Pitch rotation obtains depression angle or elevation angle images, wherein the depression angle images can be obtained in monitoring frequently and are very wide in application range, but the researches on the aspects are relatively few, and relatively effective results are relatively few. Therefore, the method has great practical significance for the research of the face depression angle correction problem.

Disclosure of Invention

In order to solve the technical problem, the invention provides a depression angle human face image correction method and system based on a self-attention mechanism.

The method adopts the technical scheme that: a depression angle face image correction method based on a self-attention mechanism comprises the following steps:

step 1: constructing a multi-input fusion countermeasure generation network based on an attention mechanism;

the multi-input fusion confrontation generation network comprises a multi-input fusion coding module, a self-attention module, a single-layer fusion module, a multi-input fusion decoding module and a confrontation generation network identification module;

the multi-input fusion coding module comprises four convolutional layers which are arranged in series, the first layer is a convolutional layer with the convolutional kernel size of 7, and the step length is 1; the second layer is a convolution layer with convolution kernel size of 5 and step length of 2; the third and fourth layers are convolution layers with convolution kernel size of 3, and step length is 2; a residual block is added behind the first layer and the second layer of convolution layers, and a normalization layer, an activation layer and a residual block are sequentially added behind the third layer and the fourth layer of convolution layers;

the self-attention module is used for constructing three characteristic graphs F, g and h for the characteristic graph F output by the multi-input fusion coding module through a convolution kernel with the size of 1; matrix multiplication and softmax operation are carried out on the characteristic diagram f and the characteristic diagram g to obtain a rectangular characteristic diagram beta_I,jThen beta_I,jMultiplying the characteristic graph h to obtain a weight value o_jAdding the weight value into the characteristic diagram F and then outputting the weight value;

the single-layer fusion module is used for fusing a plurality of picture features of the C pictures output by each convolution layer in the multi-input fusion coding module through the C ConvGRU modules;

the multilayer fusion module is used for enabling all the characteristics to be in the same scale by respectively passing through a deconvolution layer for four single-layer fusion characteristics G1, G2, G3 and G4 output by the single-layer fusion module, and respectively passing through a ConvGRU module according to the sequence of G4, G3, G2 and G1 to finally obtain multilayer fusion characteristics, wherein the multilayer fusion characteristics pass through a convolution layer with a convolution kernel size of 3 and a step size of 2 and two full-connection layers to obtain overall characteristics;

the multi-input fusion decoding module consists of four deconvolution layers, two self-attention layers and two convolution layers; the system is used for adding Gaussian noise information into the total features output by the multilayer fusion module for reconstruction to obtain new features F1, and then performing up-sampling on the features F1 to respectively form three features F2, F3 and F4 with different scales and then inputting the three features into a deconvolution layer; entering a deconvolution operation; the input of the deconvolution network of the first layer of the multi-input fusion decoding module is an up-sampling value of the output of the fourth layer of the multi-input fusion coding module after passing through the residual block and fused with F1; the input of the second layer of the convolution layer of the multi-input fusion decoding module is the result of the output of the previous layer of convolution layer after passing through the residual block, F2 and the output of the third layer of the convolution layer of the multi-input fusion encoding module after passing through the residual block are fused; the input of the deconvolution layer of the third layer of the multi-input fusion decoding module is the residual output of the deconvolution layer of the previous layer, the result of the output of the attention module after passing through the residual block, F3, the cross-layer input of the second layer of the convolution layer of the multi-input fusion coding module, and the fusion of the four values after the input picture is subjected to resize to be in a certain size; the input of the deconvolution layer of the fourth layer of the multi-input fusion decoding module is the result of the output of the attention module after passing through the residual block, the result of the output of the first convolution layer of the multi-input fusion coding module after passing through the parameter block, and the fusion input of the input picture; the face correction fine picture is output through the two convolution layers after the fourth layer of the multi-input fusion decoding module; after the input feature map passes through the unit, each feature map has a weight map which represents the association degree of each part in the feature map;

the generation confrontation network identification module consists of seven convolutional layers, and residual blocks are added into a penultimate layer and a penultimate layer;

step 2: and inputting the depression angle face image to be corrected into the multi-input fusion confrontation generation network to obtain a face correction fine picture.

The technical scheme adopted by the system of the invention is as follows: a depression angle human face image correction system based on a self-attention mechanism comprises the following modules:

the module 1 is used for constructing a multi-input fusion countermeasure generation network based on an attention mechanism;

and the module 2 is used for inputting the depression angle face image needing to be corrected into the multi-input fusion confrontation generation network to obtain a face correction fine picture.

The invention designs a multi-input depression angle human face correction method and a multi-input depression angle human face correction system based on an attention mechanism according to the characteristic of human face multi-pose image correction. The self-attention unit and the convolution gate control circulation unit are used, complementary information in a plurality of pictures can be effectively reserved, and meanwhile, the relation can be established for global pixel points in a single picture. The interference of low-price redundant information can be eliminated while beneficial information is extracted, and a residual error layer is added behind each convolution layer to enhance gradient flow, so that the training efficiency and the learning quality are improved. The invention uses single-scale and multi-scale feature fusion to establish the feature relation of the depth level of the picture, so that the generated picture has accurate face features on the whole and also has fine face picture information.

Drawings

Fig. 1 is a schematic diagram of a multi-input fusion countermeasure generation network structure of an attention mechanism according to an embodiment of the invention.

Fig. 2 is a schematic structural diagram of a self-attention module in a multiple-input fusion countermeasure generation network of an attention mechanism according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a single-layer fusion module in a multi-input fusion countermeasure generation network of an attention mechanism according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a multi-layer fusion module in a multi-input fusion countermeasure generation network of an attention mechanism according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of an authentication module of a generation countermeasure network in a multi-input fusion countermeasure generation network of an attention mechanism according to an embodiment of the present invention.

FIG. 6 is a schematic diagram of a method for correcting a face image at a depression angle M2F based on a self-attention mechanism according to an embodiment of the present invention

And carrying out face correction effect graph on a single picture on the PA data set, wherein three face pictures form a group from left to right, the left one is an input graph, the left two is a real graph, and the left three is a generated effect graph.

Fig. 7 is a diagram showing the face correction effect of two pictures on a DFW data set by the depression face image correction method based on the self-attention mechanism according to the embodiment of the present invention, where four pictures from left to right form a group, the left one and the left two are input pictures, the left three are generated pictures, and the left four are real face pictures.

FIG. 8 shows the effect of the self-attention mechanism-based dip angle face image correction method on the DFW data set in comparison with the DA-GAN and TP-GAN methods.

Fig. 9 is a structural diagram of a ConvGRU module according to an embodiment of the present invention.

Fig. 10 is a block diagram of single-layer fusion and multi-layer fusion modules according to an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

The invention provides a depression angle face image correction method based on a self-attention mechanism, which comprises the following steps of:

the multi-input fusion confrontation generation network of the embodiment comprises a multi-input fusion coding module, a self-attention module, a single-layer fusion module, a multi-input fusion decoding module and a generation confrontation network identification module;

the multi-input fusion coding module of the embodiment comprises four convolutional layers which are arranged in series, wherein the first layer is a convolutional layer with a convolutional kernel size of 7, and the step length is 1; the second layer is a convolution layer with convolution kernel size of 5 and step length of 2; the third and fourth layers are convolution layers with convolution kernel size of 3, and step length is 2; a residual block is added after the first layer and the second layer of the convolution layer, and a normalization layer, an activation layer and a residual block are sequentially added after the third layer and the fourth layer of the convolution layer;

in this embodiment, a picture passes through the first layer of convolutional layer and then outputs a feature value 128x128x64, the feature value is transmitted to the same-layer fusion structure of the first layer of convolutional layer while being transmitted to the next layer of convolutional layer, the second layer of convolutional layer outputs a feature value 64x64x64, the feature value is transmitted to the third layer of convolutional layer as an input and is simultaneously transmitted to the same-layer fusion structure of the second layer, the outputs of the third layer and the fourth layer are features 32x32x128 and 16x16x256, and the subsequent steps are consistent with those of the previous two layers. The output of the fourth layer is also input to the multi-input fusion decoding module via one convolutional layer and two fully-connected layers.

In this embodiment, the multi-input fusion coding module includes four convolution layers and four integrally fused ConvGRU units corresponding to each convolution layer.

Referring to fig. 2, the self-attention module of the present embodiment is configured to construct three feature maps F (x), g (x), and h (x) by a convolution kernel with a size of 1 for the feature map F output by the multi-input fusion coding module; performing matrix multiplication and softmax operation on the characteristic diagram f (x) and the characteristic diagram g (x) to obtain a rectangular characteristic diagram beta_i，jThen beta_i，jMultiplying the characteristic graph h (x) to obtain a weight value o_jAdding the weight value into the characteristic diagram F and then outputting the weight value;

the calculation steps are as follows:

f(x)＝W_fx，g(x)＝W_gx，h(x)＝W_hx，u(x)＝W_vx (1)

s_ij＝f(x_i)^Tg(x_i) (2)

y_i＝αo_i+x_i (5)

in the above formula, x represents an input feature map, x_iShowing the ith layer characteristic diagram. W_f，W_g，W_h，W_vFour different weight matrices are shown by which the input profile, which is also the quantity that needs to be learned in the network, is processed differently. V (x) represents the final processing of the feature values, which are then added to the original weights to obtain the final self-attention weight values.

Generating weight matrices W for f (x), g (x), and h (x)_f，W_g，W_hAre all generated by convolution kernels of scale 1 × 1, s_ijIs the intermediate module of generation, last y_iIt is the result of the generation from the attention network that is the final result obtained by multiplying the output from the attention layer by the scaling parameter a and then adding back the input profile.

Referring to fig. 3, the single-layer fusion module of the present embodiment is configured to fuse characteristics of multiple pictures for characteristics of C pictures output by each convolution layer in the multi-input fusion coding module through C ConvGRU modules;

in this embodiment, the ConvGRU module has two usage forms in the network, one is to merge the convolution results of each layer, and the number of input in this layer determines the number of ConvGRU modules, that is, the number of C is the number of input pictures and the number of ConvGRU modules in the same layer, that is, how many input pictures have to be merged in the same layer. Each ConvGRU module input is the result of the output of the kth convolutional layer passing through a residual block, which in this embodiment has only 4 convolutional layers, so K < 4.

In this embodiment, each control gate of the ConvGRU module is preceded by a weight map, and the feature map is updated and retained according to the weight map.

Please refer to fig. 9, which is a structure diagram of the ConvGRU module, where each module input is the value xt of one picture after convolution and the hidden state value ht-1 of the previous picture after being processed by the ConvGRU module. Unlike LSTM, the GRU module only has a reset gate and an update gate, after an input image is fused with a hidden state of a previous layer, a sigmod function is used, a value output by the GRU module is subjected to point multiplication with a hidden feature value of the previous layer, the operation is the update gate operation of the GRU, after the value of the hidden state is fused with a current picture feature value and a weight is given to the current picture feature value, the sigmod operation is used for determining which feature information is reserved, because a smaller value is close to zero by the sigmod, a larger value is close to 1, the multiplication of the numbers close to 0 or 1 with the previous hidden feature value can enable a part of feature values to be almost kept intact, and a part of the numbers close to 0 is returned to 0, which is equivalent to reset; the sigmod function thus here acts as a filter. The update gate is used to control how much of the state at the previous time needs to be written to the current state, with a larger value of the update gate indicating a larger proportion of the previous input. The information of the reset gate and the previous hidden state information point multiplication result are normalized through a tanh function, the result of tanh is multiplied by the information of the update gate and then added with the previous state information to obtain the hidden state information at the moment, and then the hidden state information is output or transmitted into a next ConvGRU module, namely a cycle of the recurrent neural network ConvGRU.

ConvGRU's propulsion formula is as follows:

Z_t＝σ(W_xz*X_t+W_hz*H_t-1)

R_t＝σ(W_xr*X_t+W_hr*H_t-1)

H'_t＝f(W_xh*X_t+R_t⊙(W_hh*H_t-1))

H_t＝(1-Z_t)H’_t+Z_t○H_t-1

in the above formula, Z_tIndicating an update gate, R_tDenotes a reset gate, X_tThe symbol representing the input at that moment represents the convolution operation, W being the weight value, the subscript of which represents the change of the weight from one state to another, e.g. W_xzThe same applies to other weight values, indicating that the weight changes the input from the original input to the input of the update gate. H represents a state value, H_t-1Represents the state value, H, of the last input after processing_tIndicating the state value of the last output of this input after some column operations, and o indicates the hadamard product. All convolution operations in the above operations are weighting operations using a 1 × 1 convolution kernel, the f () function represents an activation function, and the leakage corrected ReLU with a negative slope of 0.2 is used to normalize the output value after each operation. The ConvGRU module is prior art.

Referring to fig. 4, the multilayer fusion module of the present embodiment is configured to enable all the four single-layer fusion features G1, G2, G3, and G4 output by the single-layer fusion module to be in the same size through one deconvolution layer, and pass through a ConvGRU module according to the sequence of G4, G3, G2, and G1, respectively, to obtain a multilayer fusion feature, where the multilayer fusion feature passes through one convolution layer with a convolution kernel size of 3 and a step size of 2, and two full-connection layers to obtain an overall feature;

please refer to fig. 10, which is a network fusion structure, the ConvGRU module is used for both single-layer fusion and multi-layer fusion, the left part of the upper diagram is an encoder, the right part of the upper diagram is a decoder, and the middle part of the upper diagram is a four-layer fusion module.

In this embodiment, the single-layer fusion structure fuses outputs of each convolutional layer, and the network has four convolutional layers, so there are 4 single-layer fusion branches. Besides the single-layer fusion structure, the multi-layer fusion structure is a fusion structure after induction of the single-layer fusion result, the multi-layer fusion structure is more inclined to extract deep information from a plurality of inputs, and the dependency relationship among the features is searched from a space with higher latitude. The ConvGRU module first input to the multilayer fused structure is the result after the first convolutional layer passes through the single layer fused layer, the whole multilayer fused structure has 4 ConvGRU units, and the formula of the process is as follows:

in the above formula, H^FRepresenting the final output of the multi-layer fusion module, conv () representing the convolution, and representing the result of the up-sampling of the output of the nth layer convolution layer.

The result of one up-sampling of the output of the nth convolutional layer is shown.

And representing the value of the output value of the multi-layer fusion module after one up-sampling. In the multilayer fusion structure, because the input size of each layer is different, when data is transmitted from the bottom layer to the high layer, one-time up-sampling is needed, the data is fused with the input of the previous layer, and after the fusion layer is subjected to convolution operation for the last time, the data is transmitted to the deconvolution layer on the uppermost layer for deconvolution, and finally, a face picture is generated.

The multi-input fusion decoding module of the embodiment is composed of four deconvolution layers, two self-attention layers and two convolution layers; the system is used for adding Gaussian noise information into the overall features output by the multilayer fusion module for reconstruction to obtain new features F1, and then performing up-sampling on the features F1 to respectively form three features F2, F3 and F4 with different scales and then inputting the three features into a deconvolution layer; entering a deconvolution operation; the input of the deconvolution network of the first layer of the multi-input fusion decoding module is an up-sampling value of the output of the fourth layer of the multi-input fusion coding module after passing through the residual block and fused with F1; the input of the second layer of the deconvolution layer of the multi-input fusion decoding module is the result of the output of the previous layer of the deconvolution layer after passing through the residual block, F2 and the output of the third layer of the convolution layer of the multi-input fusion coding module after passing through the residual block are fused; the input of the deconvolution layer of the third layer of the multi-input fusion decoding module is the residual output of the deconvolution layer of the previous layer, the result of the output of the attention module after passing through the residual block, F3, the cross-layer input of the second layer of the convolution layer of the multi-input fusion coding module, and the fusion of the four values after the input picture is subjected to resize to be in a certain size; the input of the deconvolution layer of the fourth layer of the multi-input fusion decoding module is the result of the output of the attention module after passing through the residual block, the result of the output of the first convolution layer of the multi-input fusion coding module after passing through the parameter block, and the fusion input of the input picture; the face correction fine picture is output through the two convolution layers after the fourth layer of the multi-input fusion decoding module; after the input feature map passes through the unit, each feature map has a weight map which represents the association degree of each part in the feature map.

In this embodiment, the cross-layer input means that all layer inputs have the output of the previous layer, and in addition, as an additional input, the output of the first layer convolutional layer is input to the second self-attention module across layers, the output of the second layer convolutional layer is input to the first self-attention module across layers, and the convolutional layer of the third and fourth layers is input to the second first layer anti-convolutional layer across layers. All cross-layer inputs are inputs after fusion by the ConvGRU module.

In this embodiment, the network initially reconstructs the gaussian noise information by adding the gaussian noise information to the 256 features input by the encoding module in the same-layer fusion, reconstructs the features into 8 × 8 × 64, and then performs upsampling on the features to respectively form three features (32 × 32 × 64, 64 × 64 × 16, and 128 × 128 × 8) with different sizes, which are input into the subsequent deconvolution layer. The upsampling step here can be subdivided into deconvolution followed by activation by the relu function. Then, the deconvolution operation stage is entered, and the input of the deconvolution network of the first layer of the decoding module is the up-sampling value of the fused output of the fourth layer of the convolution layer of the encoding module and the 8 × 8 × 64 feature. The input of the second layer of deconvolution layer is the fusion of the result of the previous layer of deconvolution layer after the output of the previous layer of deconvolution layer passes through the residual block and the fusion of the third layer of coding module convolution and the same layer. The deconvolution layer input of the third layer is the residual output of the previous layer deconvolution layer, the up-sampling value of 32 × 32 × 64, the cross-layer input of the convolution layer fusion of the second layer coding module, and the fusion of these four values after the input picture is resize to a size of 32 × 32. The third deconvolution layer is followed by the self-attention network layer. The input of the fourth layer of deconvolution layer is the result after the output from the attention network passes through the residual block, the result after the output of the first convolution layer of the coding module passes through the reference block, and the input picture is fused by resize to a size of 64 × 64 × 3. The last deconvolution layer is followed by a convolution layer, the convolution layer is a fusion convolution layer of the picture generated by the double generators, and the input of the layer is the result of the local face generator after each organ passes the arrangement, the result of the residual block output by the first layer of the coding module, and the cross-layer input of the original input picture. Then, a 128 × 128 × 3 fine face correction picture can be output after two convolutional layers.

Referring to fig. 5, the generation countermeasure network identification module of the present embodiment is composed of seven convolutional layers, in which a residual block is added to the penultimate layer and the penultimate layer; thus, better convergence can be obtained in a shallower network structure.

This embodiment uses a convolution filter of size 1 in the last layer to reduce dimensionality while preserving the spatial structure of the image. The discriminator finally generates a 4x 4 probability map, the probability map can be more concentrated on organ receptive fields of different areas of the face than a probability value for generating the authenticity of the global picture, particularly, the six receptive fields of (1,1), (1,2), (1,3), (2,1) (2,2) (2,3) can just contain face organ areas under the premise of setting the coordinates of the upper left corner area as (0,0), so that the key discrimination capability of the generation confrontation network discrimination module on organs can be improved, and the generation of a finer face picture by the multi-input fusion confrontation generation network can be fed back.

Step 2: inputting the depression angle face image to be corrected into a multi-input fusion confrontation generation network to obtain a face correction fine picture.

In the embodiment, a multi-input fusion countermeasure generation network needs to be trained, and the trained multi-input fusion countermeasure generation network is obtained; the specific implementation comprises the following substeps:

step 1.1: making a training set comprising a front image dataset I_FAnd depression angle image data set I_P；

In this embodiment, a face angle-of-depression data set TFD is used, where the TFD data set includes N individual persons, each individual person has a front face picture and K angle-of-depression face pictures, and there is only one picture for each different angle of depression. These depression photographs cover almost all of the possible depression picture angles of the frontal face, and all of the frontal images are first taken from the TFD data set to form a frontal image data set I_FOther depression angle images form a depression angle image data set I_PAll of the depression images for each individual may be present in folders distinguished by individual name. Wherein A individual persons are used as training data set, B individual persons are used as testing data set. In the training set and the training data set, each depression angle face picture and a corresponding front face picture form a face picture pair.

In this embodiment, the TFD data set includes 926 individuals, each person has 6 pictures, one of which is a front picture, and the remaining 5 face depression angle pictures of 15 °, 30 °, 45 °, 60 °, and 75 °, respectively, there are 5556 pictures in total, in this embodiment, 700 individuals are taken as a training data set, 226 individuals are taken as a test data set, and a pair of picture groups is formed by the depression angle picture of each individual and the front picture, and all the pictures strictly require a size of 128 × 128.

Step 1.2: a depression angle image data set I_PInputting the depression angle picture into a multi-input fusion countermeasure generation network, and collecting a front image data set I_FAgainst a generator of a multiple-input fusion generation network, a generated picture I to be generated_GCalculating pixel loss, identity retention loss, confrontation loss, total variation regularization and total loss;

in this embodiment, the identity retention loss is to evaluate a difference between the generated frontal photo and the real frontal face photo, so that on one hand, whether the face generated by the model is accurate and reliable can be evaluated. In the embodiment, the lightCNN algorithm is used for extracting the face features, and the Euclidean distance between the face features and the real face features is generated through calculation and is used as the identity loss of the network.

Wherein d is_i() Is the feature extracted by the reciprocal I-th layer network of the feature extraction network of the lightCNN, wherein | · | | is the Euclidean distance, G (I)^input) By introducing into the generator a depression picture I^inputThe resulting front face picture, I^gtIs really a front face picture. The lightCNN is a model obtained by training thousands of pictures, can accurately extract key features of the human face, and has a reliable classification effect.

Regularization of total variation:

pixel loss:

wherein S is a picture with various sizes, for example, in the present embodiment, a 128 × 128 image is used as an input, a 128 × 128 picture is regenerated, and a certain range information can be obtained from pixel loss by reducing the scale thereof, and in the present embodiment, three image scales of 128 × 128, 64 × 64, and 32 × 32 are provided; ws and Hs indicate the width and height of the picture corresponding to different scales S. C represents a color channel, G () represents a generator generation diagram, in formulas 8 and 9, w and h both represent the width and height of the diagram,

representing the pixels at the c-channel, (w, h) position when computing the total variation regularization. I isⁱ _s,w,h,cAnd (3) a pixel point value representing that the generated picture of the s scale is located at the (w, h, c) position when the pixel loss is calculated. I is^gt _s,w,h,cRepresenting the pixel point value at which the real picture representing the s-scale is located at the (w, h, c) position when the pixel loss is calculated.

The countermeasure loss is an essential part of a framework of the countermeasure generation network, the whole network can be made to perform better through the countermeasure learning of the generator G and the discriminator D, the embodiment adopts the traditional countermeasure loss as the countermeasure loss in the text, and the following is a loss function of the countermeasure network of the embodiment:

the method is characterized in that a classical confrontation generation network formula is adopted, the identifier evaluates the pictures, gives the highest value of the real front picture and the minimum value of the generated picture to the greatest extent, and judges that the generator is trained when the identifier evaluates the real picture and the generated picture to be consistent. In equation 10, E () represents the expected value, I, of the distribution function^gtRepresenting a real picture, I^inputSide face picture representing input, G_nA generator representing an nth iteration. D_θDRepresenting a pre-trained discriminator.

The total loss of the network proposed by this embodiment is a linear combination of the above losses, and its expression form is as follows:

L＝λ₁L_ID+λ₂L_pixel+λ₃L_adv+λ₄L_tv (11)

in the above formula, L represents the calculated total loss; l is_ID，L_pixel，L_adv，L_tvRespectively representing identity loss, pixel loss, immunity loss and total variation regularization, wherein₁，λ₂，λ₃，λ₄Weights representing different loss functions, 0.1,20,0.1,10 respectively^-4。

Step 1.3: using an optimizer Adam, setting parameters as defaults, performing iterative training on the multi-input fusion countermeasure generation network, and continuously optimizing the model through a back propagation and gradient descent method according to the error of each forward propagation calculation to finally obtain the trained multi-input fusion countermeasure generation network;

in this embodiment, the setting conditions of the network parameters are as follows: adam is used as an optimizer of the network, meanwhile, the batch size is set to be 32, the initial learning rate is set to be 0.001, and the learning rate is attenuated to be 0.9 times of the original learning rate when 96 batches are trained. The network training is iterated 40 ten thousand times in total.

Step 1.4: testing the test set by using the trained model, and acquiring an image and a front image data set I_FComparing the front pictures, calculating rank-1 index, and obtaining the index by calculationIn multi-posture correction of all angles, the method has almost the best effect, and rank-1 indexes are all over 95.

For comparison with other methods, the present example uses data sets such as M2FPA, DFW, CAS-PEAL-R1 for training and testing; as shown in table 1, the accuracy of the present invention is always in the first two positions for different side faces at a depression angle of 15 °, and the present invention has strong robustness especially for large angles. Fig. 6 shows the effect of face correction at a depression angle on an M2FPA data set, which is able to recover a clearer picture and is sufficiently similar to a front picture, and fig. 7 shows the corrected face obtained by using two pictures as input on a DFW data set, which is able to better recover skin organs when a plurality of pictures are input, which is sufficient to reflect the effectiveness of the method.

TABLE 1

FIG. 8 shows the effect of the method for correcting a face image with a depression angle based on the self-attention mechanism in comparison with DA-GAN, TP-GAN and other methods on a DFW data set. It is shown in table 1 that our method has the advantages that the rank-1 accuracy of our method is highest under almost all angles, and the second best 99.5 is achieved even though the best effect is not achieved under 30-degree angles.

The invention adopts various most advanced technologies:

(1) the invention adds a self-attention module in the face generation network. Then in order to make the correction effect finer;

(2) the invention divides the picture into multiple areas for identification. Experiments show that the self-attention module fuses fine features reserved in the whole face and rich local features provided by a face local generation network, and can generate a face correction picture with the fine features;

(3) the invention provides an input enhanced multi-input face depression angle correction network; in the enhanced input module, the ConvGRU module is added, and the ConvGRU module can extract the associated features in a plurality of depression angle face images to complement information, reduce the number of parameters in a network, reduce the complexity of network training and improve the efficiency of the model.

(4) The invention can efficiently reconstruct a real, accurate and high-precision face front image from a single or a plurality of depression angle pictures.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A depression angle face image correction method based on a self-attention mechanism is characterized by comprising the following steps:

the self-attention module is used for constructing three characteristic graphs F, g and h for the characteristic graph F output by the multi-input fusion coding module through a convolution kernel with the size of 1; and performing matrix multiplication and softmax operation on the characteristic diagram f and the characteristic diagram g to obtain rectangular characteristicsGraph beta_I,Then beta_I,Multiplying the characteristic graph h to obtain a weight value o_jAdding the weight value into the characteristic diagram F and then outputting the weight value;

2. The depression-angle face image correction method based on the self-attention mechanism according to claim 1, characterized in that: in step 1, training the multi-input fusion countermeasure generating network to obtain a trained multi-input fusion countermeasure generating network; the specific implementation comprises the following substeps:

Step 1.2: the depression angle image data set I_PInputting the depression angle picture into the multi-input fusion countermeasure generation network, and inputting the front image data set I_FAgainst a generator of a multiple-input fusion generation network, a generated picture I to be generated_GCalculating pixel loss, identity retention loss, confrontation loss, total variation regularization and total loss;

step 1.3: using an optimizer Adam, setting parameters as defaults, performing iterative training on the multi-input fusion countermeasure generating network, and continuously optimizing a model through a back propagation and gradient descent method according to the error of each forward propagation calculation to finally obtain the trained multi-input fusion countermeasure generating network;

step 1.4: testing the test set by using the trained modelThe obtained image and the front image data set I_FComparing the positive pictures in the picture list, and calculating rank-1 index.

3. The depression-angle face image correction method based on the self-attention mechanism according to claim 1, characterized in that: in step 1, each control gate of the ConvGRU module is preceded by a weight map, and the feature map is updated and retained according to the weight map.

4. The depression-angle face image correction method based on the self-attention mechanism according to any one of claims 1 to 3, characterized in that: in step 1, the multi-input fusion coding module comprises four convolution layers and four integrally fused ConvGRU units, wherein each convolution layer corresponds to each layer.

5. A depression angle human face image correction system based on a self-attention mechanism is characterized by comprising the following modules:

the self-attention module is used for constructing three characteristic graphs F, g and h for the characteristic graph F output by the multi-input fusion coding module through a convolution kernel with the size of 1; and performing matrix multiplication and softmax operation on the characteristic diagram f and the characteristic diagram g to obtain a rectangleCharacteristic diagram beta_I,Then beta_I,Multiplying the characteristic graph h to obtain a weight value o_jAdding the weight value into the characteristic diagram F and then outputting the weight value;