CN113379606A

CN113379606A - Face super-resolution method based on pre-training generation model

Info

Publication number: CN113379606A
Application number: CN202110934749.3A
Authority: CN
Inventors: 孙立剑; 王军; 徐晓刚; 曹卫强; 朱岳江; 虞舒敏
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2021-09-10
Anticipated expiration: 2041-08-16
Also published as: CN113379606B

Abstract

The invention belongs to the field of computer vision and image processing, and relates to a face super-resolution method based on a pre-training generated model, which comprises the following steps: step one, collecting and inputting low-resolution images to a feature extraction module

Extracting characteristic information; step two, inputting the characteristic information into a coder to obtain an implicit matrix with the channel number being 8 times of the input size, and obtaining an implicit vector after the implicit matrix is subjected to characteristic decomposition through a separation module

Respectively inputting the generated feature and the face label data into a pre-training generation model in a cascading mode to obtain a generation feature; step three, transmitting the generated features to a decoder and fusing the feature extraction module

And outputting the target high-resolution image after decoding operation of the extracted characteristic information. The invention can amplify the face with low resolution with high magnification, can obtain 64 times of super-resolution result at most, and the super-resolution result keeps better fidelity, so that the amplified image has better improvement in the aspects of fidelity and texture fidelity.

Description

Face super-resolution method based on pre-training generation model

Technical Field

The invention belongs to the field of computer vision and image processing, and relates to a face super-resolution method based on a pre-training generated model.

Background

The size of the image resolution is directly related to the image quality, and higher high resolution means that more detail information is contained, so that greater application potential is implied. However, in practical situations, many images face the problem of low resolution, which affects the subsequent high-level visual processing. With the continuous development of computer vision technology, especially the development of deep learning, image quality enhancement methods are more and more, and the super-resolution technology is an effective means for enhancing image quality, and can significantly improve the resolution of images. The image super-resolution technology samples the low-resolution image to the high-resolution image through an algorithm means, and has very important application value in multiple fields of security monitoring, medical detection, criminal investigation and the like. For example, in a security monitoring scene, due to factors such as a camera and the surrounding environment, a photographed target can be blurred, so that the target cannot be recognized, a clear picture can be reconstructed through a super-resolution technology, the resolution of a target face is improved, and therefore help is provided for quickly positioning a target person. Therefore, the image super-resolution technology as a low-level image processing method can provide effective support for subsequent high-level processing methods such as target detection and identification.

At present, a plurality of networks related to image super resolution are provided, the processing of various scenes and objects is obviously improved, the number of networks for the face super resolution is small, a plurality of methods are used for constructing corresponding face data and then training by using the existing networks, although some progress is made, the super resolution effect is not good for the face with low resolution, and the generated countermeasure network is widely applied to a super resolution task at present, and the purpose is to enrich texture details in the restored image. But common methods of generating an antagonistic network can limit the ability to approximate natural image manifolds, or because low dimensional steganography and constraints in image space are insufficient to guide the recovery process, these methods often produce artifacts and unnatural textures with low fidelity faces.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a face super-resolution method based on a pre-training generation model, which is used for providing rich face detail features by introducing a large pre-training face generation model, can guide the pre-training generation model to enhance towards the features of an input face based on information extracted by a coding module by embedding the face detail features into a coding and decoding module based on residual attention, and further improves the quality of face image restoration by fusing various pre-training generation models and original input features through a decoder, and the specific technical scheme is as follows:

a face super-resolution method based on a pre-training generated model comprises the following steps:

step one, collecting and inputting low-resolution images to a feature extraction module

Extracting characteristic information;

step two, inputting the characteristic information into a coder to obtain an implicit matrix with the channel number being 8 times of the input size, and obtaining the implicit matrix after characteristic decomposition of a separation moduleDeriving implicit vectors

Respectively inputting the generated feature and the face label data into a pre-training generation model in a cascading mode to obtain a generation feature;

step three, transmitting the generated features to a decoder and fusing the feature extraction module

And outputting the target high-resolution image after decoding operation of the extracted characteristic information.

Further, the feature extraction module

The system is composed of 2 convolution layers of 3 multiplied by 64 multiplied by 1 and 6 residual channel attention units which are connected in series, wherein the 3 multiplied by 64 multiplied by 1 convolution layers represent the size of convolution kernels, 64 represents the number of the convolution kernels, and the last bit 1 represents the motion step of the convolution kernels; the residual error channel attention unit comprises a residual error unit and a channel attention unit, wherein the residual error unit extracts the characteristics of the input low-resolution image, inputs the characteristics into the channel attention unit to obtain a channel calibration coefficient vector beta, and outputs the channel calibration coefficient vector beta and the input characteristics of the channel attention unit as the output of the residual error channel attention unit after recalibration.

Further, the channel attention unit comprises a global average pooling layer, a ReLU nonlinear transformation layer, two convolution layers and a Sigmoid nonlinear transformation layer.

Further, the second step specifically includes: inputting characteristic information into 3 convolution modules adopted by coder

，

Each convolution module comprising a convolution layer of step 1, an active layer and a convolution layer of step 2, the first two convolution modules comprising one convolution layer3 x 64 x 02 convolutional layer, LReLU active layer and 3 x 13 x 264 x 1 convolutional layer, the last convolutional module includes a 3 x 128 x 2 convolutional layer, LReLU active layer and three (input size/8) x 128 x 1 convolutional layers, finally a 3 x 128 implicit matrix is output, the implicit matrix is subjected to feature decomposition to obtain three implicit vectors

，

Respectively inputting the data and the face label data into a residual error module in a pre-training generation model in a cascading mode to obtain corresponding generation characteristics

，

。

Furthermore, the pre-training generation model adopts a pre-training BigGAN model, each residual error module of the model comprises an up-sampling convolution, and corresponding generation characteristics are output

，

。

Further, the third step specifically includes: the decoder comprises a decoding module

Decoding module

Decoding module

Decoding module

Feature extraction module

The extracted characteristic information is input into a decoding module

In (1),

output the result and

input to a decoding module

In (1),

output the result and

input to a decoding module

In (1),

output the result and

input to a decoding module

And finally obtaining the face image with the target resolution.

Further, the first three decoding modules in the decoder

，

The system comprises a 3 x 64 x 1 convolutional layer, an LReLU nonlinear transformation layer, two residual error units and a 2-time upsampled sub-pixel convolutional layer, wherein the residual error unit comprises a first branch and a second branch, the first branch sequentially passes through a 3 x 64 x 1 convolution and an LReLU nonlinear transformation layer and a 3 x 64 x 1 convolution, the second branch directly adds the input with the output of the first branch, and the final decoding module

Comprising a 3 x 1 convolutional layer.

According to the invention, through the coding and decoding network based on the residual error structure and the channel attention convolution, the generation model of the pre-training is embedded in the middle of the coding and decoding structure, the implicit vector is generated by using the coding network, and the generator of the pre-training is guided to generate rich human face high-frequency information to provide the prior of texture and detail generation, so that the human face with low resolution is amplified with high magnification, 64 times of super-resolution result is obtained at most through the setting of the structure quantity of the residual error module in the pre-training generation model and the setting of the sampling convolution quantity on a decoder, and the super-resolution result keeps better fidelity, so that the amplified image has better improvement on the aspects of fidelity and texture fidelity, and diversified loss functions and introduced LPIPS evaluation indexes are beneficial to enhancing the visual perception quality.

Drawings

FIG. 1 is an overall flow chart of a high-magnification face super-resolution method based on a pre-training generated model according to the present invention;

FIG. 2 is a feature extraction module of the present invention

A structure diagram;

fig. 3 is a diagram of the residual channel attention unit structure of the present invention.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

The embodiment of the invention takes 8-time image super-resolution as an example for explanation, and as shown in fig. 1, a face super-resolution method based on a pre-training generated model comprises the following steps:

step one, inputting a face image with the resolution of 16 multiplied by 16, and adopting a feature extraction module consisting of a plurality of residual error channel attention units

Extracting feature information, including: contour features and texture features;

as shown in fig. 2 and 3, the feature extraction module

The method comprises the steps that 2 convolution layers of 3 x 64 x 1 and 6 residual channel attention units connected in series are formed, the concerned convolution layers Conv are 3 x 64 x 1, 3 x 3 represents the size of a convolution kernel, 64 represents the number of the convolution kernels, the last bit represents the motion step of the convolution kernel, each residual channel attention unit comprises a residual unit and a channel attention unit, the features of an input image are extracted through the residual units, the features are input into the channel attention units to obtain channel calibration coefficient vectors beta, the channel calibration coefficient vectors beta and the input features of the channel attention units are recalibrated to serve as the output of the residual channel attention units, and the channel attention units comprise a global average pooling layer, a ReLU nonlinear transformation layer, two convolution layers and a Sigmoid nonlinear lamination transformation layer.

Step two, inputting the features extracted in the step one into an encoder structure, wherein the encoder structure adopts 3 convolution modules

，

Each convolution module comprising a convolution of step 1Layers, active layers and convolutional layers of one step 2, by means of each convolutional module

Is characterized by being obtained

Finally obtaining an implicit matrix Z with the channel number being 8 times of the input size, and obtaining an implicit vector by the implicit matrix Z through a separation module

The image and face label data are jointly input into a pre-training generation model in a cascading mode, the model uses a pre-training high-resolution image generation model BigGAN to provide rich texture and detail prior knowledge for the generation of the high-resolution image, and an implicit vector required by the pre-training generation model

Providing high-level information for the generated model, and leading the pre-trained generated model to generate more high-resolution face textures and detail features by face label data;

the encoder structure adopts 3 convolution modules, specifically, the first two convolution modules comprise a 3 × 3 × 64 × 02 convolution layer, an LReLU active layer and a 3 × 13 × 264 × 1 convolution layer, the last convolution module comprises a 3 × 3 × 128 × 2 convolution layer, an LReLU active layer and three (input size/8) × 128 × 1 convolution layers, and finally a 3 × 128 implicit matrix is output, and the implicit matrix is subjected to feature decomposition to obtain three implicit vectors

Respectively inputting the residual error signals into a residual error module in a pre-training generation model, and in addition, because the generation module adopts a pre-training BigGAN model, in order to ensure that the model develops towards a high-resolution face direction, a face label and an implicit vector are cascaded and are jointly input into the residual error module;

saidThe structure of the pre-training generation model is the structure of a BigGAN model, and different from the BigGAN model, the method mainly utilizes the high-resolution detail generation capability of the BigGAN, each residual error module comprises an up-sampling convolution, and corresponding generation characteristics are output

And input to the final decoder, i.e., the decoding module.

Step three, generating output characteristics in the module by pre-training

Transferring to decoder, and fusing feature extraction module

Extracted feature information

After the operation of the decoder, the image with the target high resolution is finally output;

for the decoder, a feature extraction module

Extracted feature information

Input to a decoding module

In (1),

output the result and

input to a decoding module

In (1),

output the result and

input to a decoding module

In (1),

output the result and

input to a decoding module

Finally obtaining the face image of the target resolution, and aiming at the first three decoding modules in the decoder

，

The system comprises a 3 multiplied by 64 multiplied by 1 convolutional layer, an LReLU nonlinear transformation layer, two residual error units and a sub-pixel convolutional layer with 2 times of up sampling, wherein the residual error unit comprises two branches, one branch of the residual error units enables an input to sequentially pass through a 3 multiplied by 64 multiplied by 1 convolution, an LReLU nonlinear transformation layer and a 3 multiplied by 64 multiplied by 1 convolution, the other branch of the residual error units enables the input not to be changed, the input is directly added with the output of the first branch, and the last decoding module

Comprising a 3 x 1 convolutional layer.

Wherein, the network involved in the first step to the third step is used as a face image super-resolution network, and the training process specifically comprises the following steps:

the loss function consists of three parts: content perception based on LPIPSKnown loss, pixel loss, i.e. smoothing

And (4) loss, updating the network by using a back propagation strategy, wherein the pre-training generation model and the network parameters for calculating the content perception loss are fixed and do not participate in the training process. Using PSNR: peak Signal to Noise Ratio, Peak Signal-to-Noise Ratio, and SSIM: structural similarity index, LPIPS is used as an evaluation index of picture quality, a high-resolution face data set CelebA is selected, then the image is cut, only the face part is cut, the influence of hair hat clothes on the face is avoided, the cut picture is down-sampled to 128 x 128 by using the imresize in matlab as a high-resolution image and is down-sampled to 16 x 16 as a corresponding low-resolution image, the high-resolution face image and the low-resolution face image are used as a training set, a verification set and a test set, the whole training process is divided into two stages, the first stage adopts pixel loss for training, RMSprop is used for training, and the learning rate is set to be 0.0005; and in the second stage, content loss is introduced to carry out model fine adjustment, the learning rate is set to be 0.0001, the network is updated by using a back propagation strategy, and if the network is converged, the trained network model is stored and used as a final reasoning. Using this generator network as the final inference, 100 additional pictures of low resolution were selected as the test set. In addition, training and testing were performed on the hellen data set in the same manner, with the test results shown in table 1:

TABLE 1 comparison of the Performance of the present invention with other methods under different data sets at 8 Xmagnification (PSNR/SSIM/LPIPS)

The last line in table 1 shows that the test was performed on both Helen and CelebA, and compared with the conventional super-resolution method including bicubic up-sampling, ESRGAN, RCAN, RDN, and FSRNet, the same data set training and testing were performed, the average PSNR and SSIM of 100 test pictures obtained by the present invention both obtained higher results, and additionally LPIPS was the lowest, the best visual perception quality was maintained, and the overall picture definition was also the best.

Claims

1. A face super-resolution method based on a pre-training generated model is characterized by comprising the following steps:

Extracting characteristic information;

step two, inputting the characteristic information into a coder to obtain an implicit matrix with the channel number being 8 times of the input size, and obtaining an implicit vector after the implicit matrix is subjected to characteristic decomposition through a separation module

2. The face super-resolution method based on pre-training generated model as claimed in claim 1, wherein said feature extraction module

The system is composed of 2 convolution layers of 3 multiplied by 64 multiplied by 1 and 6 residual channel attention units which are connected in series, wherein the 3 multiplied by 64 multiplied by 1 convolution layers represent the size of convolution kernels, 64 represents the number of the convolution kernels, and the last bit 1 represents the motion step of the convolution kernels; the residual channel attention unit comprises a residual unit and a channel attention unit, wherein the residual unit extracts the characteristics of the input low-resolution image and outputs the characteristicsAnd the channel attention unit acquires a channel calibration coefficient vector beta, and the channel calibration coefficient vector beta and the input characteristics of the channel attention unit are recalibrated to be used as the output of the residual channel attention unit.

3. The method of claim 2, wherein the channel attention unit comprises a global average pooling layer, a ReLU nonlinear transformation layer, two convolution layers and a Sigmoid nonlinear transformation layer.

4. The face super-resolution method based on the pre-training generated model as claimed in claim 1, wherein the second step specifically comprises: inputting characteristic information into 3 convolution modules adopted by coder

，

Each convolution module comprises a convolution layer with step 1, an active layer and a convolution layer with step 2, the first two convolution modules comprise a convolution layer with 3 x 64 x 02, an LReLU active layer and a convolution layer with 3 x 13 x 264 x 1, the last convolution module comprises a convolution layer with 3 x 128 x 2, an LReLU active layer and three convolution layers with (input size/8) × 128 x 1, finally a hidden matrix with 3 x 128 is output, and the hidden matrix is subjected to feature decomposition to obtain three hidden vectors

，

，

。

5. The face super-resolution method based on the pre-training generated model as claimed in claim 1, wherein the pre-training generated model is a pre-training BigGAN model, each residual module of the model comprises an upsampling convolution, and the model outputs the corresponding generated feature

，

。

6. The face super-resolution method based on the pre-training generated model as claimed in claim 5, wherein the third step specifically comprises: the decoder comprises a decoding module