CN116703719A

CN116703719A - Face super-resolution reconstruction device and method based on face 3D priori information

Info

Publication number: CN116703719A
Application number: CN202310031762.7A
Authority: CN
Inventors: 刘广文; 姚汉群; 付强; 马智勇; 王伟刚; 李英超
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2023-01-10
Filing date: 2023-01-10
Publication date: 2023-09-05

Abstract

A face super-resolution reconstruction device and method based on face 3D priori information belong to the technical field of image quality enhancement. According to the invention, a 3D rendering module (3 DRenderingBlock) is added in the existing face super-resolution technical scheme for fusing the face priori information, gao Weiren face priori information with different scales is converted and then is transmitted into a priori network to be fused with face super-resolution images with different scales, and finally, the 3D super-resolution detailed images are fused with 3D high-resolution detailed images reconstructed by the 3D rendering module, so that the high-resolution face details are preferentially reserved to improve the performance of the face super-resolution, the three-dimensional structure of the face can be captured more accurately, the sense of reality and the fidelity of the images can be improved, and the generated images are more natural.

Description

Face super-resolution reconstruction device and method based on face 3D priori information

Technical Field

The invention belongs to the technical field of image quality enhancement, and particularly relates to a face super-resolution reconstruction device and method based on face 3D priori information.

Background

Super resolution reconstruction (SR) is an important field of image quality enhancement research. Research shows that the reconstruction performance of the deep learning method on the image is obviously superior to that of the traditional shallow learning method. The existing face super-resolution methods based on deep learning are various, and besides a common convolution network model, an unsupervised learning model, a GAN generation countermeasure network model and the like which are also applicable to common images, a plurality of methods for performing super-resolution on the specificity of the face are developed, including a face super-resolution method based on face priori information, a face super-resolution method based on attribute constraint and the like. The method adds prior information or attribute information of the human face into the network structure, and the information has very important semantic information, so that the network model can be helped to restore the original appearance of the human face image better. The face super-resolution method based on the face priori information has excellent effect because the prior information which cannot be learned originally by the convolution network can be used for guiding the training of the model, such as the coordinate points, the outline and the like of the face. As a super-resolution technique in a particular field, face super-resolution reconstruction can be used to recover lost face details.

However, the existing face super-resolution methods based on face prior information have some limitations, and the key of successful face super-resolution reconstruction is how to effectively use the prior knowledge of the face from one-dimensional vectors (identity and attribute), to two-dimensional images (facial landmarks, facial heat map and analytic map), to three-dimensional models, and so on. Thus, how to model and rationally utilize this a priori knowledge, and how to effectively integrate this information into the training framework, is a current challenge.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: the face super-resolution reconstruction device and method based on the face 3D priori information are used for solving the technical problem that an existing face super-resolution method based on the face priori information cannot effectively utilize the priori knowledge of the face and effectively integrate the information into a training framework.

The face super-resolution reconstruction device based on the face 3D priori information comprises a face low-resolution image input end, a multi-scale feature extraction module, a priori module, a 3D rendering module and a face super-resolution image output end; the prior module comprises a storage and data processing module of a learnable tensor C and a plurality of up-sampling modules; the 3D rendering module comprises a 3D face reconstruction module, a 3D face prior rendering module and a spatial feature transformation module; the multi-scale feature extraction module is composed of a pair of encoders and decoders, receives a low-resolution image input by a face low-resolution image input end, then performs texture feature extraction, extracts potential variables W, sends the potential variables W to a storage and data processing module of a learnable tensor C in the prior module, and obtains pictures with different scales and sends the pictures to the 3D face reconstruction module to reconstruct a 3D face image; the 3D face prior rendering module receives the 3D face image reconstructed by the 3D face reconstruction module, performs optimized rendering on the face of the 3D face, obtains a 3D prior feature map with 3D face prior information, and sends the 3D prior feature map to the spatial feature transformation module; the spatial feature transformation module performs noise adding processing on the rendered 3D priori feature map to obtain fusion feature maps with different scales; the storage and data processing module of the learnable tensor C convolves the latent variable W with the learnable tensor C, stores the convolved latent variable W and the learnable tensor C as an upsampling module and fuses the upsampling module with a fused feature map of a corresponding scale, and the fused image is used as the upsampling of the next upsampling module and fused with the fused feature map of another corresponding scale, and the process is repeated for a plurality of times, so that a super-resolution image is finally obtained and is output through the face super-resolution image output end.

The face super-resolution reconstruction method based on the face 3D priori information utilizes the face super-resolution reconstruction device based on the face 3D priori information, and comprises the following steps of:

firstly, inputting a 2D image into a multi-scale feature extraction module through a low-resolution image input end, extracting texture features of the 2D image by the multi-scale feature extraction module, simultaneously extracting potential variables W by the multi-scale feature extraction module, sending the potential variables W to a storage and data processing module of a learnable tensor C in a priori module, and obtaining a plurality of pictures with different scales by the multi-scale feature extraction module and sending the pictures to a 3D rendering module;

a 3D face reconstruction module in the 3D rendering module performs 3D face reconstruction according to the texture features extracted in the first step, obtains a 3D face image and sends the 3D face image to the 3D face priori rendering module;

the 3D face prior rendering module adopts a total loss function to the received 3D face imageOptimized rendering with an overall loss function +.>The method comprises the steps of minimizing the three-dimensional face prior information to obtain a 3D prior feature map with the 3D face prior information, and then sending the 3D prior feature map to a space feature transformation module; the spatial feature transformation module performs noise adding processing on the rendered 3D priori feature map to obtain fused feature maps with different scales;

and thirdly, the storage and data processing module of the learnable tensor C convolves the latent variable W with the learnable tensor C, stores the convolved state with the learnable tensor C as an up-sampling module, fuses the convolved state with a fusion characteristic diagram of a corresponding scale, wherein the dimension of the learnable tensor C is the same as the latent variable W and is filled by a random number, the fused image is used as the up-sampling of the next up-sampling module, fuses with the fusion characteristic diagram of another corresponding scale, and repeats for a plurality of times, and finally, a super-resolution image is obtained and is output through a face super-resolution image output end.

The 3D face reconstruction module adopts an improved model based on 3DMM, the model uses a convolutional neural network to regress parameters of the 3DMM model, a 3D coefficient vector is obtained, a 3D face image is formed, and therefore a face structure with accurate positioning is formed.

The 3D coefficient vector is expressed as x= (α, β, δ, γ, ρ) ∈r ²³⁹, wherein α∈⁸⁰ ,β∈ ⁶⁴ ,δ∈ ⁸⁰ ,γ∈ ⁹ ,ρ∈ ⁶ Respectively representing identity, facial expression, texture, illuminance and facial pose;

α, β, and δ are corresponding coefficient vectors for generating a 3D face, converting the face coefficient vector into a 3D shape S and texture T of the face image:

wherein Is the average face shape, < >>Is the average face texture; wherein->Is the average face shape, < >>Is the average face texture, B _id Principal component analysis base representing identity, B _exp Principal component analysis group representing expression, B _t Principal component analysis basis representing texture, and B _id 、B _exp 、B _t The base matrix is scaled by the standard deviation.

The overall lossFunction ofThe expression is as follows:

wherein ,a loss function of illumination intensity perceptible to the skin; />Locally identifying loss and feature type loss functions; />A face identity consistency loss function;

wherein the skin-perceivable illumination intensity loss functionExpressed as:

in the formulaRepresenting weights, j is the pair-wise image index, L is the total number of training pairs, i represents the pixel index, M represents the facial region, < >>Representing a clear image, A ⁱ Is an attention mask of a training image based on skin color obtained by training Bayes, y represents an input image, B (y) represents a regression coefficient obtained by taking y as an input, +.>Representing an image rendered using 3D coefficients B (y);

local discrimination loss and feature type loss functionFor further enhancing perceptually significant facial components, expressed as follows:

in the formula, calculating characteristic correlation by using a Gram matrix, and capturing texture information; lambda (lambda) _local and λ_fs Loss weights respectively representing local discrimination loss and feature type loss; ROI represents the region of interest of the left, right, mouth; d (D) _ROI Is a local discriminator for each region; psi represents the multi-resolution features of the learning discriminator;predictive value, y, representing a region of interest _RI A true value representing the region of interest,/->Representing the expected value;

face identity consistency loss functionThe identity information of the face is consistent, and the identity information of the face is expressed as follows:

facial feature extraction is performed here using a pre-trained ArcFace model to capture the most prominent features of identity information, where η represents a face feature extractor, λ _id Weights representing identity retention losses;representing the predicted value, y representing the actual value.

The specific method for obtaining the fused feature map with different scales after the space feature transformation module performs the noise adding treatment on the rendered 3D priori feature map comprises the following steps:

the spatial feature transformation module takes 3D face prior information in the 3D prior feature map as input to learn a mapping function, and obtains the output of a spatial feature transformation SFT layer, wherein the mapping function is expressed as follows:

wherein F represents a characteristic map of the object,representing element-by-element multiplications, A, B being a priori features;

the function provides a priori features a and a priori features B as modulation parameter pairs (a, B) from the a priori information, by spatially applying an affine transformation to each intermediate feature map, by which the output of the SFT layer is adaptively controlled.

Through the design scheme, the invention has the following beneficial effects:

the invention adds a 3D rendering module into the existing face super-resolution technical scheme for fusing face priori information based on the generation of the countermeasure network. In this way, gao Weiren face prior information with different scales can be transferred into the prior network after being converted and fused with face super-resolution images with different scales. Finally, the fused images are fused with the 3D high-resolution detail images reconstructed by the 3D rendering module so as to preferentially reserve the high-resolution face details, thereby improving the performance of the super-resolution of the face.

In the face super-resolution technology, the addition of the 3D rendering module plays a vital role. The method enables Gao Weiren face prior information of different scales to be converted into a form which can be understood by a prior network, and the form is fused with face super-resolution images of different scales. Thus, we can improve the performance of super-resolution of the face while retaining high resolution face details.

The addition of the 3D rendering module has another benefit in that we can preserve the detail information of the 3D reconstruction while fusing the high resolution face detail images. This information is critical to the performance of improving super-resolution of faces, as they can help us capture the stereo structure of faces more accurately. At the same time, the information also helps to improve the realism and fidelity of the image, so that the generated image is more natural.

Drawings

The invention is further described with reference to the drawings and detailed description which follow:

fig. 1 is a block diagram of a face super-resolution reconstruction device and a face super-resolution reconstruction method based on face 3D prior information.

Fig. 2 is a functional block diagram of a spatial feature transformation module in a face super-resolution reconstruction device and method based on face 3D prior information.

In the figure, a 1-human face low-resolution image input end, a 2-multi-scale feature extraction module, a 3-prior module, a 4-3D rendering module and a 5-human face super-resolution image output end are adopted.

Detailed Description

As shown in the figure, the face super-resolution reconstruction device based on the face 3D priori information comprises a face low-resolution image input end 1, a multi-scale feature extraction module 2, a priori module 3, a 3D rendering module 4 and a face super-resolution image output end 5; the prior module 3 comprises a storage and data processing module of a learnable tensor C and a plurality of up-sampling modules; the 3D rendering module 4 comprises a 3D face reconstruction module, a 3D face prior rendering module and a spatial feature transformation module; the multi-scale feature extraction module 2 is composed of a pair of encoders and decoders, the multi-scale feature extraction module 2 receives a low-resolution image of the face low-resolution image input end 1, then performs texture feature extraction, extracts potential variables W, sends the potential variables W to a storage and data processing module of a learnable tensor C in the prior module 3, and in addition, the multi-scale feature extraction module 2 obtains pictures with different scales and sends the pictures to a 3D face reconstruction module to reconstruct the 3D face image; the 3D face prior rendering module receives the 3D face image reconstructed by the 3D face reconstruction module, performs optimized rendering on the face of the 3D face, obtains a 3D prior feature map with 3D face prior information, and sends the 3D prior feature map to the spatial feature transformation module; the spatial feature transformation module performs noise adding processing on the rendered 3D priori feature map to obtain fusion feature maps with different scales; the storage and data processing module of the learnable tensor C convolves the latent variable W with the learnable tensor C, stores the convolved latent variable W and the learnable tensor C as an upsampling module and fuses the upsampling module with a fused feature map of a corresponding scale, and the fused image is used as the upsampling of the next upsampling module and fused with the fused feature map of another corresponding scale, and the process is repeated for a plurality of times, so that a super-resolution image is finally obtained and is output through the face super-resolution image output end 5.

wherein Is an average ofFace shape, ->Is the average face texture, B _id Principal component analysis base representing identity, B _exp Principal component analysis group representing expression, B _t Principal component analysis basis representing texture, and B _id 、B _exp 、B _t The base matrix is scaled by the standard deviation.

The overall loss functionThe expression is as follows:

wherein, for more stable rendering of the face, a skin-perceivable illumination intensity loss function is introducedExpressed as:

in the formulaRepresenting weights, j isPaired image indexes, L is the total number of training pairs, i is the pixel index, M is the face region, +.>Representing a clear image, A ⁱ Is an attention mask of a training image based on skin color obtained by training Bayes, y represents an input image, B (y) represents a regression coefficient obtained by taking y as an input, +.>Representing an image rendered using 3D coefficients B (y);

face identity consistency loss functionFor maintaining face identity informationThe agreement is expressed as follows:

The CelebA dataset was used in verifying the algorithm performance. The training phase used 162,080 images of the CelebA dataset. In the test phase 40,519 images of the CelebA test set were used. The low resolution input is generated by a bicubic downsampling method, following the protocol of the existing related face super resolution method. The high resolution image is obtained by center cropping the face images and then adjusting them to 128×128 pixels. The low resolution face image is generated by downsampling the high resolution image to 32 x 32 pixels (four times magnification) and 16 x 16 pixels (eight times magnification).

The evaluation effect of each correlation algorithm is shown in the table. PSNR is peak signal-to-noise ratio, and the larger the value is, the less the image distortion is; SSIM is a structural similarity index, and the larger the value thereof, the more similar the representative image.

Claims

1. Face super-resolution reconstruction device based on face 3D priori information, characterized by: the face super-resolution image processing device comprises a face low-resolution image input end, a multi-scale feature extraction module, a priori module, a 3D rendering module and a face super-resolution image output end; the prior module comprises a storage and data processing module of a learnable tensor C and a plurality of up-sampling modules; the 3D rendering module comprises a 3D face reconstruction module, a 3D face prior rendering module and a spatial feature transformation module; the multi-scale feature extraction module is composed of a pair of encoders and decoders, receives a low-resolution image input by a face low-resolution image input end, then performs texture feature extraction, extracts potential variables W, sends the potential variables W to a storage and data processing module of a learnable tensor C in the prior module, and obtains pictures with different scales and sends the pictures to the 3D face reconstruction module to reconstruct a 3D face image; the 3D face prior rendering module receives the 3D face image reconstructed by the 3D face reconstruction module, performs optimized rendering on the face of the 3D face, obtains a 3D prior feature map with 3D face prior information, and sends the 3D prior feature map to the spatial feature transformation module; the spatial feature transformation module performs noise adding processing on the rendered 3D priori feature map to obtain fusion feature maps with different scales; the storage and data processing module of the learnable tensor C convolves the latent variable W with the learnable tensor C, stores the convolved latent variable W and the learnable tensor C as an upsampling module and fuses the upsampling module with a fused feature map of a corresponding scale, and the fused image is used as the upsampling of the next upsampling module and fused with the fused feature map of another corresponding scale, and the process is repeated for a plurality of times, so that a super-resolution image is finally obtained.

2. The face super-resolution reconstruction method based on the face 3D priori information, which is characterized by using the face super-resolution reconstruction device based on the face 3D priori information as claimed in claim 1, is characterized in that: comprising the following steps, and the following steps are carried out in sequence:

3. The face super-resolution reconstruction method based on face 3D prior information according to claim 2, wherein the method is characterized by comprising the following steps: the 3D face reconstruction module adopts an improved model based on 3DMM, the model uses a convolutional neural network to regress parameters of the 3DMM model, a 3D coefficient vector is obtained, a 3D face image is formed, and therefore a face structure with accurate positioning is formed.

4. The face super-resolution reconstruction method based on face 3D prior information according to claim 2, wherein the method is characterized by comprising the following steps: the 3D coefficient vector is expressed as x= (α, β, δ, γ, ρ) ∈r ²³⁹, wherein α∈⁸⁰ ,β∈ ⁶⁴ ,δ∈ ⁸⁰ ,γ∈ ⁹ ,ρ∈ ⁶ Respectively representing identity, facial expression, texture, illuminance and facial pose;

wherein Is the average face shape, < >>Is the average face texture, B _id Principal component analysis base representing identity, B _exp Principal component analysis group representing expression, B _t Representation ofPrincipal component analysis basis of texture, and B _id 、B _exp 、B _t The base matrix is scaled by the standard deviation.

5. The face super-resolution reconstruction method based on face 3D prior information according to claim 2, wherein the method is characterized by comprising the following steps: the overall loss functionThe expression is as follows:

wherein the skin-perceivable illumination intensity loss functionExpressed as:

in the formula, calculating characteristic correlation by using a Gram matrix, and capturing texture information; lambda (lambda) _local Loss weight, lambda, representing local discrimination loss _fs Loss weights representing feature type losses; ROI represents the region of interest of the left, right, mouth; d (D) _ROI Is a local discriminator for each region; psi represents the multi-resolution features of the learning discriminator;predictive value, y, representing a region of interest _RI A true value representing the region of interest,/->Representing the expected value;

6. The face super-resolution reconstruction method based on face 3D prior information according to claim 2, wherein the method is characterized by comprising the following steps: the specific method for obtaining the fused feature map with different scales after the space feature transformation module performs the noise adding treatment on the rendered 3D priori feature map comprises the following steps: