CN116664435A

CN116664435A - Face restoration method based on multi-scale face analysis map integration

Info

Publication number: CN116664435A
Application number: CN202310643998.6A
Authority: CN
Inventors: 于力; 姜梦遥; 涂宇晨; 彭超; 何建
Original assignee: Research Institute Of Yibin University Of Electronic Science And Technology
Current assignee: Research Institute Of Yibin University Of Electronic Science And Technology
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2023-08-29

Abstract

The invention discloses a face restoration method based on integration of a multi-scale face analysis chart, and belongs to the technical fields of computer vision, machine learning and the like. Firstly, the invention builds a basic network along with the encoder-decoder structure, adds jump connection of the attention module of the added channel in the basic network, so that the decoder can fully utilize the effective information contained in the extracted feature map to carry out face recovery; meanwhile, the face analysis image is used as a style image to be integrated into the generated face image; finally, the invention adds a loss function which is favorable for saving identity information in training, constrains the restored face identity at the characteristic level extracted by the face recognition network, and adds a countermeasures loss and multi-scale discriminator, so that the network further generates a face image with more real details.

Description

Face restoration method based on multi-scale face analysis map integration

Technical Field

The invention belongs to the technical fields of computer vision, machine learning and the like, and particularly relates to a face restoration method based on deep learning.

Background

Face restoration is an important issue in the field of computer vision. The face restoration technology is a means capable of enabling low-quality face images to be restored to high-quality face images, and lays a foundation for subsequent realization of high-level applications such as face recognition, expression recognition and the like. At present, the face restoration technology is applied to image processing software which is frequently used by people, restoration of old photos and old movies, digital camera zooming compensation, intelligent monitoring systems and the like.

Most face restoration methods currently can be classified into two categories according to whether GAN prior information is used: the first is to use a pre-trained StyleGAN model as a GAN prior and use it as a body to design a network. The method has the advantages that the visual impression effect is more excellent, the face image with extremely low input quality can also obtain a very high-definition output result, and the method is a popular method in recent years. However, the core idea of these GAN prior-based methods is to encode degraded face images into the potential space of the pre-trained GAN, requiring complex modules to be designed to modify the already packaged GAN prior, or to fine tune the GAN prior through additional training. Therefore, the problem of large model parameters exists, so that the method has higher requirements on the video memory and the memory of training equipment, and the potential space dimension of the pre-training GAN is low, so that the space expression capability is poor, the face structure of the degraded face image can not be completely captured, and the face structure of the restored image is unnatural.

The second category is face restoration networks that use other a priori information or do not use a priori information. The method has the advantages of less network parameters and easy training, but has two common problems: (1) The output high-definition image cannot store the identity information of the original input image. The final purpose of face restoration is to perform deep visual tasks such as face detection, recognition, etc., so the identity information of the face in the image is critical to both the human and the network model. However, most of the existing methods only pay attention to the structural design of the generated network, and aim to obtain images with better visual quality, and the problem that the identity of the input image is inconsistent with that of the restored image due to the fact that information in the low-quality input image needs to be fully extracted and effectively utilized is not considered. (2) the face structure information cannot be effectively utilized. The structural information used in the network, such as a face analysis chart, a face key point chart and the like, is often extracted from a low-quality input image, and the information contained in the low-definition image is rough, if the network is estimated by directly using a simple face structure, the estimated prior image is inaccurate, and thus the recovery result is directly influenced. In addition, the existing utilization of the face structure information often uses direct feature stitching, attention mechanisms and the like, global and local information cannot be fully utilized, and therefore the help provided by the face structure information to face restoration is very limited.

Based on the above analysis, we want to be able to design a face restoration network that can meet the following three points: (1) The network model has fewer parameters and is easy to train, i.e. the GAN prior is not used. (2) The method and the device realize effective utilization of the information such as pixels, structures and the like in the input low-definition image, namely fully extract the pixel information in the low-definition face and reasonably utilize the prior information of the face. (3) The design helps preserve identity information and maintains a loss function of image fidelity.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides the face restoration network which can fully utilize the face structure and the identity information, thereby improving the fidelity of the restored face. Firstly, the invention builds a basic network by using an Encoder-Decoder structure commonly used for image restoration, wherein the Encoder part gradually extracts the characteristics in the input face low-definition picture, and the Decoder part gradually upsamples to the same resolution as the input image; secondly, adding jump connection of a channel attention module in the basic network, so that a decoder can fully utilize effective information contained in the extracted feature map to carry out face recovery; meanwhile, a face analysis chart is selected as a structure priori, and in order to integrate the priori information into a network in a high-efficiency manner, the method AdaIN for referencing style migration enables the face analysis chart to be used as a style image and fused into a generated face image; finally, the invention adds a loss function which is favorable for saving identity information in training, constrains the restored face identity at the characteristic level extracted by the face recognition network, and adds a countermeasures loss and multi-scale discriminator, so that the network further generates a face image with more real details.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a face restoration method of an encoder-decoder structure based on multi-scale face analysis map integration is characterized by comprising the following steps: and (3) adjusting the original low-definition face image to a face image with the resolution of 512 multiplied by 512, inputting the face image into a face restoration network, and obtaining a high-definition face image with the resolution of 512 multiplied by 512 after restoration.

The face restoration network comprises an initialization layer, a backbone network and an RGB conversion layer which are sequentially connected.

The initialization layer is used for adjusting the face image with the resolution of 512×512 to obtain a 512×512×32 feature map F ₀ 。

The backbone network adopts an encoder-decoder structure with jump connection added;

wherein the encoder-decoder structure comprises a fourth lower sample block, a third lower sample block, a second lower sample block, a first upper sample block, a second upper sample block, a third upper sample block, a fourth upper sample block, and a feature map F, which are sequentially connected ₀ Gradually reducing the resolution to 32×32 through 4 downsampling blocks, gradually increasing the resolution through 4 upsampling blocks, and generating a feature map F ₀ Feature maps of equal size

The jump connection mode is as follows: feature map output by downsampling blockFeature map +.>Feature fusion is performed by vector stitching, wherein the feature map output by the first downsampling block +.>Feature fusion is carried out by splicing with the self vector; feature map obtained by feature fusion and dimension reduction>As input to the next upsampling block;

the RGB conversion layer is used for characteristic graphs with the size of 512 multiplied by 32And converting to obtain a high-definition face image with the resolution of 512 multiplied by 512.

Preferably, the feature map F 'is based on the backbone network' _i ^Up Feature map F' obtained after style modulation branching " _i ^Up As the input feature map of the next up-sampling block, the structural information in the face analysis map is integrated into the face restoration network in a light-weight manner;

specifically, the style modulation branch comprises a convolution layer, an activation layer and two branch structures which are sequentially connected; the two branches structure is composed of two identical convolution layers, and the two branches respectively output style parameters a _i And b _i And the face restoration network is merged through the following formula:

where μ represents the mean and σ represents the variance.

Preferably, during the jump connection, a channel attention module is inserted for picking out features that assist in face restoration.

Preferably, a multi-scale discriminant is used to stabilize the training during training of the face restoration network.

The beneficial effects of the invention are as follows:

(1) A face restoration network based on an encoder-decoder structure is built, and jump connection for adding a channel attention mechanism is added, so that the network can fully extract and utilize effective information in a low-definition face.

(2) And adding a style modulation branch capable of effectively utilizing the face analysis diagram into the basic network for face restoration to construct a face restoration network with good performance.

(3) A penalty function is designed that can facilitate preservation of texture, identity details.

Drawings

Fig. 1 is a schematic diagram of a backbone network and a hop connection network.

Fig. 2 is a schematic diagram of a style modulation branch added to the network architecture of fig. 1.

Fig. 3 is a schematic diagram of a face restoration method based on the encoder-decoder structure incorporated by the multi-scale face analysis.

Fig. 4, a schematic view of the channel attention structure.

FIG. 5 is a schematic diagram of the face resolution map integration module.

FIG. 6 is a qualitative comparison of the Helen dataset with the most advanced face restoration method.

Fig. 7 is a quantitative comparison of the Helen dataset with the most advanced face restoration method.

Detailed Description

A detailed description of each of the detailed problems involved in the technical scheme of the present invention is given below.

(1) Building encoder-decoder face restoration network based on jump connection

The backbone network of the present invention is an encoder-decoder structure with added hop connections.

The encoder-decoder structure consists of 4 downsampled blocks and 4 upsampled blocks. Firstly, the low-definition face image is adjusted to be 512 multiplied by 512, then is input into a backbone network, the resolution is gradually reduced to 32 multiplied by 32 pixels through 4 downsampling blocks, the resolution is gradually increased through 4 upsampling blocks, and finally the face image with the same size as the input image is generated.

Considering that the features of different network layers are different, such as shallow network attention texture features and deep network attention global features, the features are important to the face restoration task, namely global features and texture features; in addition, due to the downsampling operation, the loss of some edge features is unavoidable, so that in order to utilize more and more comprehensive information in the low-definition image, jump connection is added on a backbone network, and vector splicing is carried out on feature graphs with the same resolution in an encoder and a decoder to achieve the effect of feature fusion.

(2) Channel attention module based on face restoration network

Considering that not all the features extracted from the encoder can provide effective information for human face restoration, for example, some invalid information such as noise, artifacts and the like can exist in the input low-definition image, and after all the information is introduced, the phenomenon of poor restoration such as plaque and the like of the output restored image can be caused. To solve this problem, the present invention adds a channel attention module (Channel attention module, CAM) to the feature map extracted from each downsampling block in the encoder to learn which features contribute to face restoration, and then concatenates the feature map with the feature map generated from the corresponding upsampling block in the decoder. The working mechanism of channel attention is modeling the importance of individual feature channels, and depending on the learning to enhance or suppress different channels, a "filter" like function is provided herein. The present embodiment will be described using the channel attention module send (Sequeeze and Excitation Net) as an example.

As shown in the figure4, feature map F to be output from the downsampling block _i ^Down Inputting the two-dimensional characteristic images into a two-branch structure, and in the branches for obtaining the weight, firstly, using a pooling layer to laminate the space dimension of the characteristic images, namely, aggregating the characteristic images of each two dimensions into a constant, wherein the operation is equivalent to pooling the global receptive field, and the number of characteristic channels is unchanged; then, establishing the correlation between channels through two full connection layers, and generating weight for each characteristic channel; normalizing the weight through an activation layer; finally, weighting and characteristic diagram F _i ^Down Multiplication thereby changing the importance of the different channels.

(3) Adding style modulation branches based on the face restoration network

In order to enable the face analysis map to provide more information in the face restoration network that helps to restore faces and is sufficiently lightweight, the present invention uses classical adaptive instance normalization methods (Adapt ive Instance Normalization, adaIN) in style migration; specifically, a face analysis chart is taken as input, and a style modulation branch comprises a convolution layer, an activation layer and two branch structures which are sequentially connected; the two branches structure is composed of two identical convolution layers, and the two branches respectively output style parameters a _i And b _i And the face restoration network is merged through the following formula:

where μ represents the mean and σ represents the variance.

AdaIN is based on a forward neural network, is fast to generate and supports migration of any style. Style GAN uses AdaIN to blend the Style of a real face into a generated face so as to achieve the purpose of generating a vivid synthesized face, and the AdaIN can be used for well blending the Style face image into the synthesized face image. In specific implementation, unlike AdaIN, which uses the mean and standard deviation of the face resolution as style parameters, the two style parameters of the present invention are learned by a simple convolution layer, because this task is not a simple style migration, but hopes that the face resolution can provide the face with additional prior knowledge, i.e. information that is beneficial to the final task, so that the two style parameters are learned by a simple convolution layer in a self-adaptive manner. Each face analysis image input used by the face restoration network can be extracted from the output of the previous up-sampling block, and the face analysis image input with higher definition can obtain more accurate face analysis images, so that a face generated image with higher quality is obtained.

(3) The model is trained and its effectiveness is verified experimentally.

The constraint items of the face restoration network are mainly divided into the following five points:

(a) Loss of texture detail

The invention adds the difference between the generated image and the Gram matrix of different semantic areas in the high-definition image as texture loss. The specific calculation is to extract features using VGG19 and calculate the loss using the relu3_1, relu4_1, and relu5_1 features therein. Representing the m-th layer feature in VGG19 as phi _m The resolution mask for 19 regions is denoted as M _n Then L _ss The method comprises the following steps:

wherein,,generating a face image, I _H For a real high definition face image, G (-) is used to calculate the feature phi _m Mid-semantic region mask M _n Gram matrix of (c):

where ε is a constant added to avoid division by zero.

(b) Reconstruction loss

Reconstruction loss, also known as generation loss, is used to measure the output image and the true image of the generation networkDifferences. The reconstruction loss used in the invention is the combination of pixels and characteristic space mean square error (mean square error, MSE) and aims at constraining the output of the networkAs close as possible to the real image I _H 。

Wherein,,intermediate face image generated for the i-th upsampling block, a>L is a true high-definition face image with corresponding resolution _rec The second term of (2) is the multi-scale feature matching penalty, which matches +.>And->Is a discriminant feature of (a).

(c) Identity loss

In order to prevent the situation that the qualitative and quantitative indexes of the output result are good, but the identity of the output result is difficult to keep the same with that of the original input image, the invention introduces identity loss and constrains the distance between the high-dimensional characteristics of the generated image and the real image, thereby improving the identity similarity. The specific operation is to extract the high-dimensional features of the generated image by using a pre-trained face recognition model Arcface and measure the gap by using the Euclidean distance:

where φ (-) represents the pre-trained face recognition model Arcface.

(d) Countering losses

In order to enable the face image generated by the EDSM network to have high fidelity, the invention designs a discriminator and a corresponding loss function to expand the face image into the EDSM-GAN. The specific loss function is an unsaturated loss function, the goal of which is to maximize the probability that the generated image is judged to be true, thereby providing a larger gradient to the generator early in GAN training.

Wherein L is _{GAN_D} L is the optimization target of the discriminator _{GAN_G} The optimization objective of the generator, I represents the ith arbiter, D (-) represents the arbiter,intermediate face image generated for the i-th upsampling block, a>And the real high-definition face image with the corresponding resolution is obtained.

(e) Training targets

EDSM-GAN output is achieved by minimizing L respectively _G And L _{GAN_D} To achieve the training purpose:

L _G ＝λ _ss L _ss +λ _rec L _rec +λ _id L _id +λ _adv L _{GAN_G}

wherein lambda is _ss 、λ _rec 、λ _id 、λ _adv Texture detail loss, reconstruction loss, identity loss, and counter loss weight values, respectively.

In addition to the constraint terms, in order to generate more realistic details, the recovery result can be improved in subjective performance index, and a discriminator is further added in the embodiment, so that the problem of the GAN network for generating the high-resolution image is solved. In the early stage of training, the high-resolution image has many details which are not easy to imitate due to the insufficient capability of the generator, so that the discriminator can very easily distinguish true and false images, and can not provide feedback for the generator for reference, and the gradient problem in GAN training can be easily amplified, so that the training is unstable, and even the model is crashed. To circumvent this problem, the present invention chooses to use a multi-scale discriminant to stabilize the training.

The multi-scale discriminator consists of a plurality of sub-discriminators, wherein the input of each sub-discriminator is a middle face restoration image corresponding to the up-sampling block and a corresponding resolution true value obtained by down-sampling a high-definition image. Each sub-arbiter can only obtain limited information on the image of the corresponding resolution, so different sub-discriminators are used to solve different levels of discrimination tasks. For example, it is more difficult to resolve the generated image with lower resolution than the true or false of the real image, so that the low resolution sub-discriminant plays an important role in stabilizing the early training of GAN. With the training, the action of the face recovery system is gradually transferred to the sub-discriminants with higher resolution, so that all the discriminants can have excellent discrimination capability, and effective feedback is provided for the face recovery network.

In order to more comprehensively show the visual impression effect of the face restoration model, three groups of low-definition image inputs with slight to severe degradation degrees are selected, and fig. 6 is a comparison of the full-face qualitative effect under the Helen data set. It can be seen that when the degradation degree of the input image is light, all face restoration algorithms can restore the result with good visual impression; when the degradation degree of the input image is serious, the effect of all face restoration algorithms is sliding down to different degrees; the method proposed by Wan et al has no restoration effect on the low-definition input image, the GFPGAN and GCFSR perform poorly, and color blocks in the low-definition image are still reserved in the restored image; GPEN and EDSM-GAN provided by the invention restore better face structure and texture details.

Fig. 7 shows the image quality index comparison results of different face restoration algorithms on the Helen dataset. It can be seen that the network provided by the invention achieves the optimal effect on four image quality indexes, and the method provided by the invention has good recovery performance.

Claims

1. A face restoration method of an encoder-decoder structure based on multi-scale face analysis map integration is characterized by comprising the following steps: the original low-definition face image is adjusted to a face image with the resolution ratio of 512 multiplied by 512, and then the face image is input into a face restoration network, so that a high-definition face image with the resolution ratio of 512 multiplied by 512 after restoration is obtained;

the face restoration network comprises an initialization layer, a backbone network and an RGB conversion layer which are connected in sequence;

the initialization layer is used for adjusting the face image with the resolution of 512×512 to obtain a 512×512×32 feature map F ₀ ；

The backbone network adopts an encoder-decoder structure added with jump connection;

The jump connection mode is as follows: feature map F of downsampling block output _i ^Down Feature map of the same resolution as the upsampling block outputFeature fusion is performed by vector stitching, wherein the feature map F is output by the first downsampling block ₁ ^Down Feature fusion with self vector concatenationCombining; feature map obtained by feature fusion and dimension reduction>As input to the next upsampling block;

2. The encoder-decoder structured face restoration method based on multi-scale face analysis map integration of claim 1, wherein: based on the backbone network, the feature map is formedFeature map obtained after style modulation branching +.>As the input feature map of the next up-sampling block, the structural information in the face analysis map is integrated into the face restoration network in a light-weight manner;

where μ represents the mean and σ represents the variance.

3. The encoder-decoder structure face restoration method based on multi-scale face analysis map integration as claimed in claim 1 or 2, wherein: during the jump connection, a channel attention module is inserted for picking out features that assist in face restoration.

4. A method for face restoration based on encoder-decoder structure incorporating a multi-scale face analysis map as recited in claim 3, wherein: in training the face restoration network, a multi-scale discriminant is used to stabilize the training.