CN116109510A

CN116109510A - Face image restoration method based on structure and texture dual generation

Info

Publication number: CN116109510A
Application number: CN202310141472.8A
Authority: CN
Inventors: 李剑波; 尹泽召; 黄进; 冯义从; 汪依帆; 曾涛; 荣鹏; 王新元; 刘俊宏
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-05-12

Abstract

The invention discloses a face image restoration method based on structure and texture dual generation, which relates to the technical field of image restoration, realizes restoration of damaged face images by a deep learning method, solves the problem of inconsistent structure and texture after the face images are restored, improves the restoration effect of large-area damaged images, and comprises the following steps: step S1: preprocessing an input image to obtain a face image to be repaired; step S2: establishing a face image restoration model generated based on structure and texture dual, and inputting the image obtained in the step S1 into the image restoration model for training; step S3: the face image restoration model is obtained by continuous iterative training until the network finally converges; step S4: and inputting the damaged face image into a trained face image restoration model to obtain a restored face image.

Description

Face image restoration method based on structure and texture dual generation

Technical Field

The invention relates to the technical field of image restoration, in particular to the technical field of a face image restoration method based on structure and texture dual generation.

Background

Image restoration aims at restoring damaged area pixels in an image and keeping the filled image as consistent as possible with the original image at the visual and semantic level. It is not only critical in computer vision tasks, but also an important cornerstone for research of other image processing tasks. Face restoration plays an important role in practical application as one of the important branches. Compared with common image restoration, the human face has stronger semantics and more complex texture details, so that not only the reasonability of the human face structure is required to be considered, but also the character information is required to be reserved in the restoration process.

Image restoration has made a great progress from earlier traditional methods to the current deep learning-based methods. The traditional method is only suitable for repairing the missing pictures of the single Zhang Jianshan small region, and lacks semantic consistency. Therefore, deep learning-based methods are becoming mainstream.

Pathak first proposes Context Encoders, uses the encoder-decoder network to extract features and outputs reconstruction results, which is also the first base GAN restoration method; iizuka et al introduce a local-global dual arbiter on the basis of Context Encoder, while using dilation convolution to propose a GLCIC network; yu et al propose a deepfill network to borrow or replicate feature information from known background patches by a contextual awareness mechanism to generate missing fronts Jing Buding; the Nazeri et al designed edge connect using a two-stage model, first generated an edge phantom of an irregular missing region by an edge generator as a priori result, and then filled the missing region using a picture patching network based on the edge phantom.

However, these methods do not employ both structural and textural features, resulting in inconsistent structure and texture of the output image. Defect repair involves both high-level semantic knowledge and low-level pixel information, and only by high-structure fusion of the two parts of information, the image repair level of the human visual system can be approximated. To this end Guo et al propose a novel dual-flow network for image restoration that models texture synthesis and texture-guided structural reconstruction of structural constraints in a coupled manner to obtain more reasonable outputs. Although this method results in improved consistency between structure and texture, there are two problems, 1) the relationship between structure and texture is not fully considered, resulting in a limited degree of consistency therebetween. 2) Lack of context reasoning considering image global and local pixel continuity results in repaired images with structural distortion and texture blurring defects, especially when large areas are broken. Based on the two defects, the scheme provides a face image restoration method based on structure and texture dual generation. The method can realize large-area face image restoration without damage while enhancing the texture and structure consistency of the face image restoration.

Disclosure of Invention

The invention aims at: in order to solve the technical problems, the invention provides a face image restoration method based on structure and texture dual generation.

The invention adopts the following technical scheme for realizing the purposes:

a face image restoration method based on structure and texture dual generation comprises the following steps:

step S1: preprocessing an input image to obtain a face image to be repaired;

step S2: establishing a face image restoration model generated based on structure and texture dual, and inputting the image obtained in the step S1 into the image restoration model for training;

step S3: the face image restoration model is obtained by continuous iterative training until the network finally converges;

step S4: and inputting the damaged face image into a trained face image restoration model to obtain a restored face image.

As an optional technical solution, in the step 2, the face image restoration model is a structure for generating an countermeasure network, and is composed of a generator and a discriminator;

the generator comprises a dual encoder-decoder and a feature fusion part, and the discriminator consists of a texture discriminator and a structure discriminator.

As an optional technical solution, the convolution layers of the dual encoder-decoder employ gated convolution to encode and decode features, and a batch normalization layer is added after each gated convolution layer, expressed as:

Gating＝∑∑W _g ·I

Feature＝∑∑W _f ·I

Output＝BN(φ(Feature)⊙σ(Gating))

wherein I represents a feature map; gating means Gating; feature represents a Feature map after convolution; output represents the final Output feature map, W _g And W is _f Respectively representing different convolution kernels; phi is the LeakyReLU activation function, sigma is the Sigmoid activation function, and alpha represents element level multiplication, gating of gating convolution as compared to hard gatingThe closer the gating value is to 1, the more effective pixels, the more value between 0 and 1, BN represents batch normalization.

As an alternative, the dual encoder-decoder,

during the encoding stage, the left and right encoders respectively receive the broken image and the broken structural image to encode texture and structural features,

at the decoding stage, the texture decoder synthesizes a texture constrained by borrowing structural features from the texture encoder, while the texture decoder recovers the texture-guided structure by retrieving the texture features from the texture encoder.

As an optional technical solution, the discriminator is a dual-flow discriminator with a texture branch and a structural branch, and the structural branch of the discriminator is further provided with an additional edge detector for edge extraction, wherein two discriminator trunks are composed of common convolutions, and the edge detector is composed of convolutional neural network residual blocks.

As an optional technical solution, the preprocessing in step SS1 is:

firstly, the size of the image is adjusted, the image is adjusted to 256 multiplied by 256 by clipping and filling,

then, a binarization mask M is obtained from an irregular mask data set provided by NVIDIA to artificially damage the image, so that a damaged image is obtained; graying the damaged image to obtain a damaged gray image;

and finally, extracting the face contour information from the damaged gray level image through a Canny edge detection algorithm to obtain a damaged edge image.

As an optional technical scheme, step 3 adopts a CelebA-HQ dataset for training, including training images and test images, the experimental equipment adopts NVIDIA V100, and the whole model is realized by using pyrerch; when training the model, the batch size is set to 8, and optimization is performed by using an Adam optimizer.

As an alternative, first 2×10 is used ^-4 Initial training is performed using a learning rate of 5×10 ^-5 A learning rate fine tuning model of (a); learning rate is littleThe tuning model is trained using joint loss, including reconstruction loss, perception loss, style loss, and antagonism loss.

As an alternative solution, four loss functions are as follows:

reconstructing a loss function: l (L) _rec ＝E[||I _out -I _gt || ₁ ]

Wherein E represents the desire, I _out Representing the generated picture, I _gt A picture representing a true image is displayed, I.I ₁ Represents L ₁ Norm number

Perceptual loss function:

the perceived loss of pre-training by VGG-16 on ImageNet was used to simulate human visual perception of image quality. Wherein E represents the desire, I _out Representing the generated picture, I _gt A picture representing a true image is displayed, I.I ₁ Represents L ₁ Norms, phi _i Representing the activation diagram of the ith pooling layer of Vgg16, in actual process, i E [1,3 ]]。

Style loss function:

wherein E represents the desire, I _out Representing the generated picture, I _gt A picture representing a true image is displayed,

it represents an activation diagram phi _i Gram matrix of (c).

Countering loss function:

wherein E represents the desire, G represents the generator, D represents the discriminator, I _gt Representing a real picture, E _gt Representing a true edge map, I _out Representing the generated pictures, E _out Representing the generated edge map.

As an alternative solution, in order to guide the dual encoder-decoder to generate structural and texture features, at F _s And F _t Intermediate losses are also introduced above:

L _inter ＝L _structure +L _texture ＝BCE(E _gt ,P _s (F _s ))+l ₁ (I _gt ,P _t (F _t ))

wherein I is _gt Representing a real picture, E _gt Representing a true edge map, P _s And P _t Is a mapping function composed of convolution kernel residual blocks, and is used for integrating structural characteristics F _s And texture feature F _t Respectively mapped to a corresponding edge map and RGB picture.

The beneficial effects of the invention are as follows:

1. the current image restoration model can not restore by using the information of the structure and the texture characteristics simultaneously and fully, so that the problem that the restored image has inconsistent structure and texture is caused.

2. The existing research still has the problem of structural distortion or texture blurring when repairing large-area irregular missing areas, mainly because the context of the image is not fully utilized, resulting in insufficient connection from local features to overall consistency. The invention can fully utilize the context information of the image and has better repairing effect when the image is damaged in a large area.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a network structure diagram of the method of the present invention.

Fig. 3 is an adaptive bi-directional feature fusion module (Adaptive Dual Feature Fusion, ADFF) in the method generator of the present invention.

FIG. 4 is a block diagram of a gated aggregated context switch (Gated Aggregated Contextual Transformations, GACT) module in the method generator of the present invention

FIG. 5 is a schematic diagram showing the qualitative comparison effect of the method of the present invention with other methods.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

A face image restoration method based on structure and texture dual generation, referring to fig. 1, comprises the following steps:

and step S1, preprocessing an input image to obtain a face image to be repaired. The size of the image is first adjusted, and the image is adjusted to 256×256 size by clipping and filling. And then, acquiring a binarization mask M from the irregular mask data set provided by NVIDIA to artificially damage the image, so as to obtain a damaged image. And carrying out graying treatment on the damaged image to obtain a damaged gray image, and finally extracting face contour information from the damaged gray image through a Canny edge detection algorithm to obtain a damaged edge image.

And S2, building a face image restoration model generated based on the structure and texture dual, and inputting the image obtained in the S1 into the image restoration model for training.

The facial image restoration model generated based on the dual of the structure and the texture is shown in fig. 2, and the model is composed of a generator and a discriminator based on the structure of the generated countermeasure network. The generator comprises a dual encoder-decoder and a feature fusion part, and the discriminator consists of a texture discriminator and a structure discriminator.

Specifically, the dual encoder-decoder uses a connection similar to U-net, the two encoders on the left and right of the encoding stage respectively receive the corrupted image and the corrupted structure image to encode texture and structural features, during the decoding stage the texture decoder synthesizes texture constrained by borrowing structural features from the structural encoder, and the structural decoder restores texture-guided structure by retrieving texture features from the texture encoder. By using the dual structure, the structure and the texture can be well complemented, thereby improving the consistency of the texture and the structure.

The convolution layer of the dual encoder-decoder adopts the gate convolution to encode and decode the characteristics, and compared with partial convolution, the gate convolution learns the characteristics in an end-to-end mode and dynamically updates the mask, so that the method can effectively adapt to the condition of uneven pixel distribution, and can lead the repair result to be clearer and accord with the upper and lower semantics. At the same time, a batch normalization layer is added after each gated convolution layer to prevent gradient extinction during training, which can be expressed as:

Gating＝∑∑W _g ·I

Feature＝∑∑W _f ·I

Output＝BN(φ(Feature)⊙σ(Gating))

wherein I represents a feature map; gating means Gating; feature represents a Feature map after convolution; output represents the final Output feature map, W _g And W is _f Respectively representing different convolution kernels; BN represents batch normalization; phi is the LeakyReLU activation function, sigma is the Sigmoid activation function, and as compared to hard gating, the gating value of the gating convolution is between 0 and 1, with a gating value closer to 1 indicating more valid pixels.

In addition, 6 gating aggregated context conversion (Gated Aggregated Contextual Transformations, GACT) modules are introduced into the dual encoder-decoder, which are embedded between the encoder and decoder in a gated residual connection, enabling capturing of image remote context information and enriching of interesting patterns. GACT module As shown in FIG. 3, the GACT module is designed to be disassembledSplitting, converting and aggregating strategies. (i) splitting: feature map x for 256 channels of input ₁ The sub-feature map (ii) was reduced to 4 64 channels using 4 3 x 3 gated convolutions: the convolution kernels of each gated convolution have different void fractions. The larger void fraction enables the convolution kernel to focus on a larger area of the input image, while the convolution kernel with smaller void fraction focuses on a local pattern of smaller receptive fields. (iii) polymerization: the 4 context conversion features from different receptive fields are finally aggregated by channel dimension splicing and standard gating convolution to obtain fusion features x ₂ . In addition, the residual error connection structure is also used for reference, firstly, for x ₁ And (3) performing gating convolution and Sigmoid operation by using a 3X 3 standard to form a threshold g, and performing gating weighting on the converted fusion characteristic and the original characteristic to obtain a final output characteristic, wherein the weighting formula is as follows: x is x ₁ ×g+x ₂ ×(1-g)。

After the complete structural and texture features are generated by the dual encoder-decoder portion of the generator, the two features are further fused using an adaptive bi-directional feature fusion module (Adaptive Dual Feature Fusion, ADFF). By controlling the fusion ratio of textures and structures, the method adaptively fuses two semantic features, so that the repair result is more reasonable when the structural continuity and textures are enhanced. The ADFF module is shown in fig. 4.

Specifically, the texture map output by the decoder is denoted as F _t The structural feature map is denoted as F _s . To build texture-aware structural features, soft-gating G _t Is formulated as:

G _t ＝σ(SE(g([F _s ,F _t ])))

wherein []Representing a concatenation of channel dimensions, g (·) represents a convolution with a convolution kernel size of 3. SE (-) represents the channel attention mechanism for obtaining important channel dimension information. Sigma (·) is a sigmoid activation function, using G _t Texture features can be dynamically fused to structural features, and the fusion formula is as follows:

where α is a learnable parameter, ☉ and pixel-by-pixel point multiplication and pixel-by-pixel addition, respectively. The same method calculates texture features perceived by the structure. The fusion formula is as follows:

G _s ＝σ(SE(h([F _s ,F _t ])))

and finally, fusing texture and structural characteristics through the following formula to obtain final fusion characteristics.

F _b ＝SK(k([F _S ',F _t ']))

Where SK is a convolution kernel attention mechanism that can adaptively select an appropriate convolution kernel that helps repair the consistency of image structure and texture.

The resulting fused features are finally fed into a contextual feature aggregation (Contextual Feature Aggregation, CFA) module that generates more vivid details by modeling long-term spatial dependencies.

The arbiter is a dual stream arbiter having texture branches and structural branches, the structural branches of the arbiter also having an additional edge detector for edge extraction. Wherein two discriminator trunks consist of common convolutions, and in order to improve the stability of the generated countermeasure network, common normalization is also used. The edge detector is composed of convolutional neural network residual blocks.

And step S3, training is continuously iterated until the network finally converges, and a face image restoration model is obtained.

The invention adopts CelebA-HQ data set for training, comprising 28000 training images and 2000 test images. The experimental equipment adopts NVIDIA V100, and the whole model is realized by PyTorch. When training the model, the batch size is set to 8, and optimization is performed by using an Adam optimizer. First using 2×10 ^-4 Initial training is performed using a learning rate of 5×10 ^-5 Is a learning rate fine tuning model.

The model is trained using joint loss, including reconstruction loss, perception loss, style loss, and antagonism loss, to obtain visually true and semantically reasonable repair results.

Reconstruction loss: l (L) _rec ＝E[||I _out -I _gt || ₁ ]

Perceptual loss:

Style loss:

it represents an activation diagram phi _i Gram matrix of (c).

Countering losses:

To guide the pairThe encoder-decoder is capable of generating structural and textural features, at F _s And F _t Intermediate losses are also introduced above: l (L) _inter ＝L _structure +L _texture ＝BCE(E _gt ,P _s (F _s ))+l ₁ (I _gt ,P _t (F _t ))

The total loss is: l (L) _joint ＝λ _rec L _rec +λ _perc L _perc +λ _style L _style +λ _adv L _adv +λ _inter L _inter

Wherein lambda is _rec ,λ _perc ,λ _style ,λ _adv And lambda (lambda) _inter Respectively representing the calculated parameters of the corresponding loss. It is set as:

λ _rec ＝10,λ _perc ＝0.1,λ _style ＝250,λ _adv =0.1 and λ _inter ＝1

Step S4: and inputting the damaged face image into a trained face image restoration model to obtain a restored face image. To verify the effectiveness of the algorithm, the experiment used a test set of CelebA-HQ datasets, and the algorithm was compared qualitatively and quantitatively with the EdgeConnect, RFR-inpainting, CTSDG algorithm under different mask area ratios.

Qualitative analysis: as shown in fig. 5, fig. 5a shows a broken face image to be repaired, and in fig. 5b, the repaired face structure is distorted and severely distorted when the EdgeConnect is broken in a large area, and only when the face structure is broken in a small area, a good result can be produced. In fig. 5c, RFR-inpainting produces excessively smooth content, and in case of large area breakage, there are problems of color inconsistency, artifacts, texture blurring, etc. In fig. 5d, CTSDG also has a problem of texture blurring distortion. Fig. 5e shows the repair result of the present invention, and it can be seen that the face structure and texture of the repair are more consistent, the semantics are more reasonable, and the better repair result can be generated even when the large area is broken. Fig. 5f shows a true image corresponding to the ground fault image.

Quantitative analysis: experiments were performed on the CelebA-HQ dataset, with 10% to 50% different proportions of masks representing the size of the damaged area, and the results generated were quantitatively compared. Mainly, three evaluation indexes are needed, and the PSNR, SSIM and MAE are shown in the following table, and compared with other methods, the method has the optimal result on all three indexes. (+.cndot.C. representing the larger and better value, +.cndot.C. representing the smaller and better value, +.cndot.C. representing the best result by bold)

Table 1: objective evaluation index comparison of CelebA-HQ data set experimental results

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The face image restoration method based on the structure and texture dual generation is characterized by comprising the following steps of:

step S1: preprocessing an input image to obtain a face image to be repaired;

2. The face image restoration method based on the dual generation of the structure and the texture according to claim 1, wherein in the step 2, the face image restoration model is a structure for generating an countermeasure network, and is composed of a generator and a discriminator;

3. A face image restoration method based on structure and texture dual generation according to claim 2, wherein the convolution layers of the dual encoder-decoder employ gated convolution to encode and decode features, and a batch normalization layer is added after each gated convolution layer, expressed as:

Gating＝∑∑W _g ·I

Feature＝∑∑W _f ·I

Output＝BN(φ(Feature)⊙σ(Gating))

wherein I represents a feature map; gating means Gating; feature represents a Feature map after convolution; output represents the final Output feature map, W _g And W is _f Respectively representing different convolution kernels; phi is the LeakyReLU activation function, sigma is the Sigmoid activation function, and as compared with hard gating, the gating value of the gating convolution is between 0 and 1, the closer the gating value is to 1, the more effective pixels are represented, and BN represents the batch normalization.

4. A face image restoration method based on structure and texture dual generation according to claim 2, wherein said dual encoder-decoder,

5. A face image restoration method based on structure and texture dual generation according to claim 2, wherein the discriminator is a dual-flow discriminator with texture branches and structure branches, the structure branches of the discriminator are also provided with an additional edge detector for edge extraction, wherein two discriminator trunks are composed of common convolutions, and the edge detector is composed of convolutional neural network residual blocks.

6. The face image restoration method based on the structure and texture dual generation according to claim 1, wherein the preprocessing in step SS1 is:

7. The face image restoration method based on structure and texture dual generation according to claim 1, wherein the step 3 adopts a CelebA-HQ dataset for training, comprises training images and test images, experimental equipment adopts NVIDIA V100, and the whole model is realized by PyTorch; when training the model, the batch size is set to 8, and optimization is performed by using an Adam optimizer.

8. A face image restoration method based on structure and texture dual generation according to claim 7, characterized in that firstly 2×10 is used ^-4 Initial training is performed using a learning rate of 5×10 ^-5 A learning rate fine tuning model of (a); the learning rate fine tuning model is trained using joint loss, including reconstruction loss, perception loss, style loss, and antagonismLoss.

9. The face image restoration method based on structure and texture dual generation according to claim 8, wherein four loss functions are as follows:

reconstructing a loss function: l (L) _rec ＝E[||I _out -I _gt || ₁ ]

Wherein E represents the desire, I _out Representing the generated picture, I _gt A picture representing a true image is displayed, I.I ₁ Represents an L1 norm;

perceptual loss function:

the perceived loss of pre-training by VGG-16 on ImageNet is used to simulate human visual perception of image quality, where E represents the desire, I _out Representing the generated picture, I _gt A picture representing a true image is displayed, I.I ₁ Represents L ₁ Norms, phi _i Representing the activation diagram of the ith pooling layer of Vgg16, in actual process, i E [1,3 ]]；

Style loss function:

it represents an activation diagram phi _i Gram matrix of (a);

countering loss function:

wherein E represents the desire, G represents the generator, D represents the discriminator, I _gt Representing a real picture, E _gt Representing a true edge map, I _out The picture to be generated is represented by a picture,E _out representing the generated edge map.

10. A face image restoration method based on structure and texture dual generation according to claim 1, characterized in that, in order to guide the dual encoder-decoder to be able to generate structure and texture features, in F _s And F _t Intermediate losses are also introduced above: