CN113658040A

CN113658040A - Face super-resolution method based on prior information and attention fusion mechanism

Info

Publication number: CN113658040A
Application number: CN202110794066.2A
Authority: CN
Inventors: 张九龙; 马仲杰; 屈小娥
Original assignee: Xian University of Technology
Current assignee: Beijing Hai Bai Sichuan Science And Technology Co ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-11-16
Anticipated expiration: 2041-07-14
Also published as: CN113658040B

Abstract

The invention discloses a face super-resolution method based on prior information and an attention fusion mechanism, which comprises the steps of constructing a training set and a test set; inputting the training set into a rough super-resolution network for processing to obtain I_SR1(ii) a Will I_SR1Respectively inputting the data into an encoder network and a priori information extraction network to obtain a characteristic graph f and an analytic graph p; inputting the feature map f and the analytic map p into a feature fusion network to be fused to obtain f_Fusion(ii) a Will f is_FusionInputting the data into a decoder network for decoding to obtain a final result I_SR(ii) a Constructing joint loss functions without iterationGenerating a super-resolution network model through training instead of minimizing the loss function; the invention solves the problem of insufficient use of face prior information, fuses the feature map and the analysis map by utilizing an attention mechanism, respectively fuses the analysis map and the feature map corresponding to different face components, increases the guidance function of the analysis map on the super resolution of the face image, improves the reconstruction efficiency and strengthens the reconstruction effect.

Description

Face super-resolution method based on prior information and attention fusion mechanism

Technical Field

The invention belongs to the technical field of digital image processing methods, and relates to a face super-resolution method based on prior information and an attention fusion mechanism.

Background

Image super-resolution is a very important research problem in the fields of computer vision and image processing, and the application of image super-resolution reconstruction technology to face images is called 'face illusion (Hallucination)' or face super-resolution (SR), and is a super-resolution problem specific to the field of face images. In many practical cases, the face image is always of low quality, limited by the physical imaging system and some human factors. The images are often low in resolution and poor in identifiability, and block communication, criminal investigation and case solving, safety enhancement and the like, so that the face super-resolution has important research significance. With the development of the deep learning technology, the face image super-resolution method based on deep learning obtains good effect. At present, the mainstream face super-resolution method based on deep learning includes the following types: the super-resolution method based on the CNN network, the super-resolution method based on the GAN network, the super-resolution method based on reinforcement learning, the super-resolution method based on ensemble learning, and the face super-resolution method based on prior information guidance.

The four methods treat the face image as a universal image, and the super-resolution algorithm of the universal image is also suitable for super-resolution of the face image as a whole. On the other hand, the face image has the characteristics of strong structurality and obvious characteristics. For example, face prior information such as facial landmarks, face resolution maps, and facial heatmaps, and thus more accurate methods can be designed specifically. The universal face super-resolution method ignores face prior information and generates a face image with a fuzzy face structure. Most of the existing networks extract image features through convolution operation, and an equalization processing method is adopted for each channel and position feature, so that each feature has different importance degrees, and the equalization processing causes the network to spend much computing resources on the unimportant features.

Disclosure of Invention

The invention aims to provide a face super-resolution method based on prior information and an attention fusion mechanism, which solves the problem of insufficient use of the prior information of a face in the prior art, and effectively improves the quality of face image super-resolution reconstruction, including PSNR and SSIM.

The technical scheme adopted by the invention is that a face super-resolution method based on prior information and attention fusion mechanism is implemented according to the following steps:

step 1, making an original image data set and performing data enhancement, then inputting a face image subjected to data enhancement processing into a degradation model to be processed to obtain a low-resolution image data set, performing bicubic up-sampling on the low-resolution image to obtain an image with the same size as a high-resolution image as the low-resolution data set, and finally dividing the data set into a training set and a testing set;

step 2, inputting the image obtained in the step 1 into a rough super-resolution network for processing to obtain an image I after rough super-resolution processing_SR1；

Step 3, training set image I obtained in step 2_SR1Inputting the data into a coder network for feature extraction to obtain a feature map f;

step 4, the image I obtained in the step 2 is processed_SR1Inputting the prior information into a prior information extraction network to extract the prior information to obtain an analytic graph p, wherein the prior information extraction network consists of ResNet and a stacked hourglass network;

step 5, inputting the feature diagram f obtained in the step 3 and the analysis diagram p obtained in the step 4 into a feature fusion network for fusion of the analysis diagram and the feature diagram to obtain a fused feature diagram f_Fusion。

Step 6, the characteristic diagram f obtained in the step 5 is processed_FusionInputting the data into a decoder network for decoding to obtain a final super-resolution processing result I_SR。

Step 7, the I obtained in the step 2_SR1And the original image is input into a pixel-by-pixel loss function to obtain l₁The analytic graph p obtained in the step 4 and the original data set are usedThe analytic graph p-in is input into a pixel-by-pixel loss function to be calculated to obtain l₂And (4) obtaining a final result I obtained in the step 6_SRAnd the original image is input into a pixel-by-pixel loss function to be calculated to obtain l₃Adding the above loss functions to obtain L_total. Continuously iterating to minimize the loss function, and finally generating a super-resolution network model after training;

and 8, setting hyper-parameters of the super-resolution network model, inputting the preprocessed test set image in the step 1 into the super-resolution network model, and finally generating a high-resolution face image with clear detail texture and better effect through residual network processing and loss function minimized iteration.

The present invention is also characterized in that,

the step 1 specifically comprises the following steps:

step 1.1, downloading a CelebAMask-HQ data set, wherein a total amount of 30000 high-definition face images of 1024x1024 are obtained, and cutting the images into 128x128 images by using a resize function of matlab as the size of an original image, so that the calculation amount is reduced.

And step 1.2, carrying out mirror image turning on all images in the data set to obtain 60000 human face images and obtain a human face data set with enhanced data.

And step 1.3, performing degradation processing on the data set obtained in the step 1.2, inputting all images in the data set into a prepared degradation model in advance to generate a corresponding low-resolution face image, and simulating a degradation process in reality.

The degradation function is particularly complex and the super-resolution is difficult because many factors (including blur, noise, etc.) in the actual environment can reduce the resolution of the image. Therefore, in the existing super-resolution technology research, the degradation process is simplified, only blurring, down-sampling and noise are considered, as shown in formula 1,

wherein k represents a fuzzy core, which means that the fuzzy core performs convolution operation on the high-resolution face image, ↓isdownsampling operation, s represents a downsampling factor, and n represents noise. Thus, the degradation process can be described as blurring the high resolution face image, then 8 times down-sampling the blurred image, and then adding noise to the resulting image to obtain a degraded low resolution face image with a size of 16x 16.

Step 1.4, carrying out double-thrice upsampling operation on the low-resolution face image obtained in the step 1.3 to obtain a low-resolution face image I with the size consistent with that of the original image_LRAnd the size is 128x 128.

Step 1.5, according to 6: 2: 2 divide the data set in step 1.4 into a training set, a validation set and a test set. Wherein 36000 images are in the training set, and 12000 images are in the verification set and the test set.

The step 2 specifically comprises the following steps:

for the low-resolution face image I obtained in the step 1.5_LRPerforming a coarse super-resolution process, i.e. I_LRThe image is led into a CoarseSRNet network to be processed to obtain I_SR1(ii) a As shown in the formula 2, as shown in the formula,

I_SR1＝CoarseSRNet(I_LR) (2)

wherein I_LRRepresenting the low resolution image after a bicubic up-sampling, CoarseSRNet represents the coarse super-resolution network employed.

The CoarseSRNet network in the step 2 adopts a 3x3 convolution kernel and a ReLU activation function, 64 filters are used for generating 64 feature maps, and finally, a result I after rough super resolution is obtained through 3x3 convolution_SR1Its size remains 128x 128.

The step 3 specifically comprises the following steps:

as shown in the formula 3, as shown in the formula,

f＝Encoder(I_SR1) (3)

step 3.1, the I obtained in step 2_SR1Inputting the data into a feature extraction network for feature extraction, wherein the feature extraction network uses an encoder structure. The encoder uses 64 convolution kernels of 3 × 3 with a step size of 2, and performs a batch normalization operation on the input image I_SR1Down samplingAnd obtaining a 64x64 size feature map of 64 channels from 64x64, and realizing the mapping from the image space to the feature space.

And 3.2, combining an attention mechanism and a residual block to form a residual attention network to extract features. And (4) inputting the feature map obtained in the step (3.1) into a residual error attention network to extract deep features, so as to obtain a multi-channel feature map.

And 3.3, inputting the characteristic diagram obtained in the step 3.2 into a 3x3 convolution layer, and obtaining an extracted characteristic diagram f through convolution, normalization and Tanh activation function. The profile channel is 64 and has dimensions 64x 64.

The step 4 is specifically that,

as shown in the formula 4, as shown in the formula,

p＝PriorEstimate(I_SR1) (4)

step 4.1, the result I after the rough super resolution obtained in the step 2 is processed_SR1Inputting the data into a priori information extraction network, and checking I by adopting 128 convolution checks of 7x7_SR1Performing convolution, and then performing normalization and ReLu operation to obtain 128 feature maps of 64x 64;

and 4.2, constructing a stacked hourglass network for prior information extraction. And stacking 4 hourglass networks for extracting the face analysis graph. In order to effectively merge features across scales and retain spatial information of different scales, the stacked hourglass network adopts a jump connection mechanism at a symmetrical layer time. The resulting features were post-processed followed by a 1x1 convolutional layer. Finally, the shared features are concatenated to two separate 1 × 1 convolutional layers to generate a landmark heat map and a parse map.

And 4.3, inputting the feature map obtained in the step 4.1 into a stacked hourglass network, and processing to obtain a face analysis map p with 128 channels, wherein the size of the face analysis map p is 128x64x 64.

The step 5 specifically comprises the following steps:

inputting the feature map f of the 64 channels obtained in the step 4.3 and the face analysis map p of the 128 channels obtained in the step five into a feature fusion network for fusion of the analysis maps and the feature maps to obtain a fused feature map f_FusionWith a size of 64x64x11, for a total of 11 channels, one for each channelThe features respectively correspond to a face component, namely face skin, left eyebrow, right eyebrow, left eye, right eye, left ear, right ear, nose, mouth, upper lip and lower lip, and 11 face components in total.

Step 5.1, constructing a feature fusion network, which mainly comprises three parts, wherein the first part is formed by 1x1 convolution and is used for carrying out dimension reduction processing on a face analysis graph; the second part is composed of an attention module CBAM, and the feature maps are weighted through a channel attention mechanism and a space attention mechanism to obtain feature maps describing 11 different face components; the third part is that the feature graph f after final fusion is obtained by respectively adding and averaging the feature graph describing different face components and the analysis graph_Fusion。

Step 5.2, using 11 convolution kernels of 1x1 to reduce the dimension of the 128-channel face analysis graph p obtained in the step 4.3 to 11 channels to obtain p_jThe value range of j is 1 to 11, which respectively represent an analysis graph corresponding to a face component. Implementation-specific loss function l₃And (4) restraining.

And 5.3, processing the feature map by adopting an attention mechanism to obtain the feature map subjected to weighting processing aiming at each face component, and then cascading.

An attention module is formed through a serial channel attention mechanism and a space attention mechanism, the importance degrees of different space positions and different channels in each feature are automatically obtained through a learning mode, and the useful features are improved and the features which are not important to the current task are restrained by multiplying different weights.

Step 5.4, step 5.3 is executed for 11 times in a circulating way, and the feature graphs corresponding to the 11 surface components are respectively subjected to weighting processing to obtain the feature f after the attention mechanism processing_jThe size of which is 64x64x64, and j ranges from 1 to 11, and this feature is used to cascade with the resolution map of the corresponding facial component.

Step 5.5, the face analysis picture p obtained in the step 5.2 is used_jThe characteristic diagram f of the corresponding subscript processed by the attention mechanism obtained in the step 5.4_jThe weighted average operation is carried out on the obtained data,obtaining a fused feature map

Its size is 64x64x 1. As shown in the formula 5, as shown in the formula,

wherein

Representing the characteristics after the fusion of the jth channel, Mean representing the cross-channel averaging operation, Cbam representing the pair f_jThe attention-weighting process is performed so that,

representing element-by-element multiplication.

Step 5.6, the fused characteristic diagram obtained in the step 5.5 is processed

Cascading to obtain the output value f of the final characteristic fusion network_FusionThe size of which is 64x64x11, as shown in equation 6,

wherein f is_FusionRepresenting the output of the final feature fusion network, cat represents the concatenation operation,

the corresponding fusion characteristics of the j surface components are shown. j from 1 to 11 denotes 11 face components.

The step 6 is specifically that,

the fused feature map f obtained in the step 5.3_FusionInputting the data into a decoder for decoding, adding an deconvolution layer after network convolution, normalization and ReLU activation for up-sampling processing. Finally obtaining a result I after convolution by 3x3_SR. While using a jump connection to convert a low resolution image I_LRAnd an image I after a coarse super-resolution process_SR1And the reconstruction effect can be better realized by splicing with the output result of the feature fusion module.

The step 7 is specifically that,

step 7.1, define the joint loss function, as shown in equation 7,

wherein, the loss function adopts a mean square error loss function, N represents the number of images in the training set, hr⁽ⁱ⁾The high resolution image corresponding to the ith low resolution image is shown.

Showing the result of the i-th image after the rough super-resolution processing.

Representing the true analytic graph, p, corresponding to the ith image⁽ⁱ⁾Representing a real face analysis graph obtained by the ith image through a prior information estimation network;

and the final result obtained after the ith image is subjected to super-resolution processing is shown.

Step 7.2, I output in step 2_SR1Original image hr, original analysis chart

Analytic graph p extracted through network and final result I_SRAnd inputting the image into a pixel-by-pixel loss function, and generating a high-resolution image through pixel-by-pixel loss function processing. The loss function is continuously minimized iteratively.

Step 7.3, continuously iterating step 7.2 to obtain a joint loss function L_totalUsing the minimum group of weight parameters as the trained model parametersAnd counting to obtain the trained super-resolution network model.

The step 8 is specifically that,

and (3) training a model by using an RMSprop algorithm, inputting the test set data preprocessed in the step (1) into the model generated in the step (7.3), and finally generating a super-resolution processed high-definition face image through residual error network processing and joint loss function minimum iteration.

The invention has the beneficial effects that:

(1) the method of the invention introduces a channel attention mechanism into the residual block to extract the characteristics, so that the network can learn purposefully, adjust the characteristic channel information in a self-adaptive manner, enhance the expression capability of the characteristics and help to recover more details such as contour texture.

(2) The method of the invention fuses the feature map and the analysis map by using an attention mechanism, respectively fuses the analysis map and the feature map corresponding to different facial components, increases the guiding function of the analysis map on the super resolution of the face image, more effectively utilizes the extracted useful features and inhibits useless features. The network can accurately distribute the computing resources according to the weight, the reconstruction efficiency is improved, and the reconstruction effect is enhanced.

Drawings

FIG. 1 is a schematic diagram of the whole structure of a super-resolution network used in a face super-resolution method based on prior information and an attention fusion mechanism according to the present invention;

FIG. 2 is a schematic diagram of a feature extraction network structure used in a face super-resolution method based on prior information and an attention fusion mechanism according to the present invention;

FIG. 3 is a schematic diagram of a prior information extraction network used in a human face super-resolution method based on prior information and an attention fusion mechanism according to the present invention;

FIG. 4 is a schematic diagram of a feature fusion module used in a face super-resolution method based on prior information and attention fusion mechanism according to the present invention;

FIG. 5 is a diagram of a conventional face prior;

FIG. 6 is a schematic diagram of a face analysis graph used in a face super-resolution method based on prior information and an attention fusion mechanism according to the present invention;

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention discloses a face super-resolution method based on prior information and attention fusion mechanism, which is implemented according to the following steps:

step 1, an original image data set is made and data enhancement is carried out, then a face image after data enhancement processing is input into a degradation model to be processed to obtain a low-resolution image data set, then the low-resolution image is subjected to bicubic up-sampling to obtain an image with the same size as a high-resolution image as the low-resolution data set, and the data set is divided into a training set and a testing set.

The step 2 specifically comprises the following steps: for the low-resolution face image I obtained in the step 1.5_LRPerforming a coarse super-resolution process, i.e. I_LRThe image is led into a CoarseSRNet network to be processed to obtain I_SR1(ii) a As shown in the formula 2, as shown in the formula,

I_SR1＝CoarseSRNet(I_LR) (2)

wherein I_LRRepresenting the low resolution image after a bicubic up-sampling, CoarseSRNet represents the coarse super-resolution network employed. The LR image is subjected to rough super resolution processing.

Step 3, training set image I obtained in step 2_SR1Inputting the data into an encoder network for feature extraction to obtain a feature map f. As shown in the formula 3, as shown in the formula,

f＝Encoder(I_SR1) (3)

the method specifically comprises the following steps:

step 3.1, the I obtained in step 2_SR1Inputting the data into a feature extraction network for feature extraction, wherein the feature extraction network uses an encoder structure. The size of the parse graph is downsampled to 64x64, taking into account the computational cost. To make it specialThe characteristic size is consistent, 64 convolution kernels of 3 multiplied by 3 are used by an encoder, the step length is 2, and an input image I is subjected to batch normalization operation_SR1Down-sampling to 64x64 to obtain a 64x64 size feature map of 64 channels, and mapping from the image space to the feature space is realized.

Step 4, the image I obtained in the step 2 is processed_SR1The method comprises the following steps of inputting the prior information into a prior information extraction network to extract the prior information to obtain an analytic graph p, wherein the prior information extraction network consists of ResNet and a stacked hourglass network, and specifically comprises the following steps:

as shown in the formula 4, as shown in the formula,

p＝PriorEstimate(I_SR1) (4)

and 4.2, extracting an analytic graph by adopting a stacked hourglass network. And stacking 4 hourglass networks for extracting the face analysis graph. The resulting features were post-processed followed by a 1x1 convolutional layer. Finally, the shared features are concatenated to two separate 1 × 1 convolutional layers to generate a landmark heat map and a parse map.

Step 5, inputting the feature diagram f obtained in the step 3 and the analysis diagram p obtained in the step 4 into a feature fusion network for fusion of the analysis diagram and the feature diagram,obtaining a fused feature map f_Fusion. The method specifically comprises the following steps:

inputting the feature map f of the 64 channels obtained in the step 4.3 and the face analysis map p of the 128 channels obtained in the step five into a feature fusion network for fusion of the analysis maps and the feature maps to obtain a fused feature map f_FusionThe size of the face component is 64x64x11, a total of 11 channels are provided, and each channel is characterized by corresponding to a face component, namely face skin, left eyebrow, right eyebrow, left eye, right eye, left ear, right ear, nose, mouth, upper lip, lower lip and a total of 11 face components.

An attention module is formed through a serial channel attention mechanism and a spatial attention mechanism, the feature maps of 64 channels obtained in the step 4.3 are input into the attention module to be subjected to weight marking, the importance degree of each feature channel is automatically obtained through a learning mode, each channel is multiplied by different weights, then the feature maps which are noticed by the channels are input into the spatial attention module, the importance degrees of different spatial positions in each feature are automatically obtained through a similar learning mode, and the different spatial positions in the feature maps are multiplied by different weights to promote useful features and restrain features which are not important for the current task.

Step 5.5, the face analysis picture p obtained in the step 5.2 is used_jThe characteristic diagram f of the corresponding subscript processed by the attention mechanism obtained in the step 5.4_jWeighted average operation is carried out to obtain a fused characteristic diagram

Its size is 64x64x 1. As shown in the formula 5, as shown in the formula,

wherein

representing element-by-element multiplication.

Step 7, the I obtained in the step 2_SR1And the original image is input into a pixel-by-pixel loss function to obtain l₁The analytic graph p obtained in the step 4 and the analytic graph in the original data set are used

Input into a pixel-by-pixel loss function to obtain l₂And (4) obtaining a final result I obtained in the step 6_SRAnd the original image is input into a pixel-by-pixel loss function to be calculated to obtain l₃Adding the above loss functions to obtain L_total. Continuously iterating to minimize loss function, training to generate super

Distinguishing a network model;

the step 7 is specifically that,

step 7.1, define the joint loss function, as shown in equation 7,

Step 7.2, I output in step 2_SR1Original image hr, original analysis chart

Analytic graph p extracted through network and final result I_SRAnd inputting the image into a pixel-by-pixel loss function, and generating a high-resolution image through pixel-by-pixel loss function processing. And continuously iterating to minimize the loss function, and training to finally generate the super-resolution network model.

Step 7.3, continuously iterating step 7.2 to obtain a joint loss function L_totalAnd taking the minimum group of weight parameters as the trained model parameters to obtain the trained super-resolution network model.

Examples

A face super-resolution method based on prior information and attention fusion mechanism is disclosed, as shown in FIG. 1, and is implemented specifically according to the following steps:

step 1, an original image data set is made and data enhancement is carried out, then a face image after data enhancement processing is input into a degradation model to be processed to obtain a low-resolution image data set, then the low-resolution image is subjected to bicubic up-sampling to obtain an image with the same size as a high-resolution image as the low-resolution data set, and the data set is divided into a training set and a testing set. The method specifically comprises the following steps:

step 1.1, download the CelebAMask-HQ dataset and crop the image to 128x128 as the original image size using the resize function of matlab.

And 1.2, carrying out mirror image overturning on all images in the data set to carry out data enhancement.

And step 1.3, inputting the data set obtained in the step 1.2 into a prepared degradation model to generate a corresponding low-resolution face image, and simulating a degradation process in reality. As shown in the formula 1, as shown in the figure,

wherein k represents a fuzzy kernel, which means that the fuzzy kernel performs convolution operation on the high-resolution face image, ↓ represents downsampling operation, s represents a downsampling factor, n represents noise, and the low-resolution face image after being processed and degraded is obtained, and the size of the low-resolution face image is 16x 16.

And 1.5, dividing the data set in the step 1.4 into a training set, a verification set and a test set. Wherein 36000 images are in the training set, and 12000 images are in the verification set and the test set.

The step 2 specifically comprises the following steps:

in order to take the accuracy of the subsequent feature extraction and prior information extraction into consideration, the low-resolution face image I obtained in step 1.5 is firstly subjected to_LRPerforming a coarse super-resolution process, i.e. I_LRThe image is led into a CoarseSRNet network to be processed to obtain I_SR1(ii) a As shown in the formula 2, as shown in the formula,

I_SR1＝CoarseSRNet(I_LR) (2)

wherein I_LRRepresenting the low resolution image after a bicubic up-sampling, CoarseSRNet represents the coarse super-resolution network employed. The general image super-resolution network SRCNN is improved to obtain a simplified SRCNN serving as CoarseSRNet, and LR images are subjected to rough super-resolution processing.

Step 3, training set image I obtained in step 2_SR1Inputting the data into the encoder network to perform feature extraction to obtain a feature map f, as shown in fig. 2.

The method specifically comprises the following steps:

step 3.1, the I obtained in step 2_SR1Inputting the data into a feature extraction network for feature extraction, wherein the feature extraction network uses an encoder structure. As shown in the formula 3, as shown in the formula,

f＝Encoder(I_SR1) (3)

the size of the parse graph is downsampled to 64x64, taking into account the computational cost. Therefore, in order to make the feature size consistent, the encoder uses 64 convolution kernels of 3 × 3 with a step size of 2, and then performs a batch normalization operation on the input image I_SR1Down-sampling to 64x64 to obtain a 64x64 size feature map of 64 channels, and mapping from the image space to the feature space is realized.

And 3.2, inspiring by a residual error attention module (RCAN), and combining an attention mechanism and a residual error block to form a residual error attention network to extract features. And (4) inputting the feature map obtained in the step (3.1) into a residual error attention network to extract deep features, so as to obtain a multi-channel feature map.

The step 3.2 is specifically as follows:

step 3.2.1, the traditional deep learning method adopts an equalization processing method in channel domains with different importance, so that a large amount of computing resources are wasted in unimportant features. In order to solve the problems, a residual attention block RAB is constructed, and a channel attention mechanism is introduced into the residual attention block, so that a network can learn purposefully, effective features can be extracted more effectively, and useless features are suppressed. Capturing weight information implied by a channel domain through an attention mechanism so as to more efficiently distribute computing resources and accelerate network convergence;

step 3.2.2, combine 12 residual attention blocks RAB to form a residual attention network, as shown in fig. 2.

And 3.2.3, inputting the multi-channel feature map f obtained in the step 3.1 into a residual error attention network to extract deep features, and obtaining the extracted deep features.

And 3.3, inputting the characteristic diagram obtained in the step 3.2.3 into a 3x3 convolution layer, and obtaining an extracted characteristic diagram f through convolution, normalization and Tanh activation function. The profile channel is 64 and has dimensions 64x 64.

Step 4, the image I obtained in the step 2 is processed_SR1The prior information is input into a prior information extraction network to extract the prior information to obtain an analytic graph p, as shown in fig. 3, wherein the prior information extraction network is composed of a ResNet and a stacked hourglass network, and specifically comprises the following steps:

the prior information of the human face mainly comprises a human face landmark image, a landmark heat map, a human face analytic map and the like, and due to the fact that under the condition that the resolution of the image is too small, key points of the human face are not accurate enough, the subsequent prior information can influence the process of guiding the super-resolution of the human face. Therefore, the face analysis graph is selected as the face prior information instead of the face key point, the above three kinds of face prior information are shown in figure 5,

step 4.1, the result I after the rough super resolution obtained in the step 2 is processed_SR1The global feature is input into a priori information extraction network, and generally, the larger the convolution kernel is, the larger the receptive field is, and the better the obtained global feature is. So 128 convolution checks I of 7x7 are used_SR1Convolution is carried out, and then 128 feature maps of 64x64 are obtained through normalization and ReLu operation, as shown in formula 4,

p＝PriorEstimate(I_SR1) (4)

and 4.2, constructing a stacked hourglass network for prior information extraction. Inspired by the latest success of the stacked hourglass network in human body posture estimation, the stacked hourglass network is adopted to extract an analytic graph. And stacking 4 hourglass networks for extracting the face analysis graph. Since the analytic graph is two-dimensional, in the a priori estimation network, all features except the last layer are shared between the two tasks. In order to effectively merge features across scales and retain spatial information of different scales, the stacked hourglass network adopts a jump connection mechanism at a symmetrical layer time. The resulting features were post-processed followed by a 1x1 convolutional layer. Finally, the shared features are concatenated to two separate 1 × 1 convolutional layers to generate a landmark heat map and a parse map.

Step 5, inputting the feature diagram f obtained in the step 3 and the analysis diagram p obtained in the step 4 into a feature fusion network for fusion of the analysis diagram and the feature diagram to obtain a fused feature diagram f_Fusion. As shown in fig. 4, the feature fusion module specifically includes:

inputting the feature map f of the 64 channels obtained in the step 4.3 and the face analysis map p of the 128 channels obtained in the step five into a feature fusion network for fusion of the analysis maps and the feature maps to obtain a fused feature map f_FusionThe size of the face component is 64x64x11, there are 11 channels, and each channel is characterized by a face component, which is face skin, left eyebrow, right eyebrow, left eye, right eye, left ear, right ear, nose, mouth, upper lip, lower lip, and 11 face components in total, as shown in fig. 6.

Step 5.3, in existing approaches, the facial structure may not be fully exploited, as features of different facial components are typically extracted by a shared network. Thus, a priori information present in different face components may be ignored by the network. Therefore, different facial regions should be restored separately for better performance. Therefore, the feature map is processed by an attention mechanism to obtain the feature map after weighting processing is performed on each face component, and then cascading is performed.

Its size is 64x64x 1. As shown in the formula 5, as shown in the formula,

wherein

representing element-by-element multiplication.

Cascading to obtain the output value of the final feature fusion network

The size of which is 64x64x11, as shown in equation 6,

Step 6, the characteristic diagram f obtained in the step 5 is processed_FusionInputting the data into a decoder network for decoding to obtain a final super-resolution processing result I_SR. In particular to a method for preparing a high-performance nano-silver alloy,

the fused feature map f obtained in the step 5.3_FusionInput to a decoder for decoding, decoder and encoderThe structure is similar, and the method is also formed by combining residual blocks, and only adds an deconvolution layer to perform upsampling processing after network convolution, normalization and ReLU activation. Finally obtaining a result I after convolution by 3x3_SR. At the same time in order to better utilize I_LRAbundant low-frequency information of image contained in shallow image feature effectively utilizes low-resolution image I by utilizing jump connection_LRAnd an image I after a coarse super-resolution process_SR1And the low-frequency information is spliced with the output result of the feature fusion module, so that the low-frequency information can be directly transmitted to the tail end of the module through jump connection, and a better reconstruction effect is realized.

Step 7, the I obtained in the step 2_SR1And the original image is input into a pixel-by-pixel loss function to obtain l₁Inputting the analytic graph p obtained in the step 4 and the analytic graph p-in the original data set into a pixel-by-pixel loss function to calculate to obtain l₂And (4) obtaining a final result I obtained in the step 6_SRAnd the original image is input into a pixel-by-pixel loss function to be calculated to obtain l₃Adding the above loss functions to obtain L_total. Continuously iterating to minimize the loss function, and finally generating a super-resolution network model after training;

the step 7 is specifically that,

step 7.1, define the joint loss function, as shown in equation 7,

Representing the true analytic graph, p, corresponding to the ith image⁽ⁱ⁾Is shown asThe i images are subjected to a real face analysis graph obtained by a priori information estimation network;

Step 7.2, I output in step 2_SR1Original image hr, original analysis chart

The step 8 is specifically that,

the model was trained using the RMSprop algorithm with an initial learning rate of 2.5 x 10-4 and a minimum batch of 14. λ is set to 0.8 empirically. The training was run with a batch size of 8 and a set learning rate of 10^-31.2 × 10 per iteration⁵The secondary reduction is half;

and (3) inputting the test set data preprocessed in the step (1) into the model generated in the step (7.3), and finally generating a super-resolution processed high-definition face image through residual error network processing and combined loss function minimum iteration.

Claims

1. A face super-resolution method based on prior information and attention fusion mechanism is characterized by comprising the following steps:

step 1, downloading an original image data set, including an original face image and an original face analysis image p-and performing data enhancement, inputting the original image subjected to data enhancement processing into a degradation model to process to obtain a low-resolution image, performing double-thrice up-sampling on the low-resolution image to obtain an image with the same size as a high-resolution image as a low-resolution data set, and finally dividing the data set into a training set and a testing set;

step 2, inputting the training set image obtained in the step 1 into a rough super-resolution network for processing to obtain a training set image I after rough super-resolution processing_SR1；

step 4, training set image I obtained in step 2_SR1Inputting the prior information into a prior information extraction network to extract the prior information to obtain an analytic graph p, wherein the prior information extraction network consists of ResNet and a stacked hourglass network;

Step 7, the I obtained in the step 2_SR1Calculating a loss function l from the pixel-by-pixel loss function of the input original image₁Analyzing graph p obtained in step 4 and analyzing graph in original image data set

The loss function l is calculated by inputting the loss function into the pixel-by-pixel₂And (4) obtaining a super-resolution processing result I obtained in the step (6)_SRAnd inputting the original image into a pixel-by-pixel loss function to obtain a loss function l₃To be connected toThe loss functions of the surfaces are added to obtain a joint loss function L_total. Continuously iterating to minimize the loss function, and finally generating a super-resolution network model after training;

and 8, setting hyper-parameters of the super-resolution network model, inputting the test set subjected to pretreatment in the step 1 into the super-resolution network model, and performing residual network treatment and loss function minimum iteration to finally generate a high-resolution face image with clear detail texture and better effect.

2. The method according to claim 1, wherein step 1 is specifically:

step 1.1, downloading a data set to obtain a high-definition face image, and cutting the image into 128x128 serving as the size of an original image by using a resize function of matlab to reduce the calculated amount;

step 1.2, carrying out mirror image turning on all images in the data set to obtain a data-enhanced face data set;

step 1.3, performing degradation processing on the data set obtained in the step 1.2, inputting all images in the data set into a prepared degradation model in advance to generate corresponding low-resolution face images, and simulating a degradation process in reality;

step 1.4, carrying out double-thrice upsampling operation on the low-resolution face image obtained in the step 1.3 to obtain a low-resolution face image I with the size consistent with that of the original image_LR；

Step 1.5, according to 6: 2: 2 divide the data set in step 1.4 into a training set, a validation set and a test set.

3. The method according to claim 1, wherein step 2 is specifically:

I_SR1＝CoarseSRNet(I_LR) (2)

wherein I_LRRepresenting the low resolution image after bicubic up-sampling, CoarseSRNet representing the coarse super-resolution network employed;

the CoarseSRNet network in the step 2 adopts a 3x3 convolution kernel and a ReLU activation function, 64 filters are used for generating 64 feature maps, and finally, a result I after rough super resolution is obtained through 3x3 convolution_SR1。

4. The method according to claim 1, wherein step 3 is specifically:

step 3.1, the I obtained in step 2_SR1Inputting the data into a feature extraction network for feature extraction, wherein the feature extraction network uses an encoder structure. The encoder uses a convolution kernel of 3x3 with a step size of 2, and performs a batch normalization operation on the input image I_SR1Down-sampling to 64x64, obtaining a 64x64 size feature map of 64 channels, and realizing the mapping from the image space to the feature space, as shown in formula 3,

f＝Encoder(I_SR1) (3)

and 3.2, combining an attention mechanism and a residual block to form a residual attention network to extract features. Inputting the feature map obtained in the step 3.1 into a residual error attention network to extract deep features to obtain a multi-channel feature map;

and 3.3, inputting the characteristic diagram obtained in the step 3.2 into a 3x3 convolution layer, and obtaining an extracted characteristic diagram f through convolution, normalization and Tanh activation function. The signature channel is 64.

5. The method according to claim 1, characterized in that said step 4 is in particular,

step 4.1, the result I after the rough super resolution obtained in the step 2 is processed_SR1Inputting into a priori information extraction network, and checking I by convolution of 7x7_SR1Convolution is carried out, and then a feature map of 64x64 is obtained through normalization and ReLu operation, as shown in formula 4,

p＝PriorEstimate(I_SR1) (4)

and 4.2, constructing a stacked hourglass network for prior information extraction. Stacking 4 hourglass networks to extract a face analysis graph; in order to effectively merge features in a cross-scale mode and retain spatial information in different scales, the stacked hourglass network adopts a jump connection mechanism in a symmetrical layer time; the resulting features were post-processed followed by a 1x1 convolutional layer. Finally, the shared features are connected to two separate 1x1 convolutional layers to generate a landmark heat map and a parse map;

and 4.3, inputting the characteristic diagram obtained in the step 4.1 into a stacked hourglass network, and processing to obtain a face analysis diagram p with 128 channels.

6. The method of claim 1, wherein the step 5 comprises:

inputting the feature map f of the 64 channels obtained in the step 4.3 and the face analysis map p of the 128 channels obtained in the step five into a feature fusion network for fusion of the analysis maps and the feature maps to obtain a fused feature map f_FusionThe feature map has 11 channels, and the features of each channel respectively correspond to a face component, namely face skin, left eyebrow, right eyebrow, left eye, right eye, left ear, right ear, nose, mouth, upper lip and lower lip, and totally 11 face components; the method comprises the following specific steps:

Step 5.2, using 11 convolution kernels of 1x1 to reduce the dimension of the 128-channel face analysis graph p obtained in the step 4.3 to 11 channels, and obtaining an analysis graph p of 11 channels_jJ ranges from 1 to 11, each representing an analysis graph corresponding to a face component; utensil for cleaning buttockLoss function l of body realization process₃And (4) restraining.

Step 5.4, step 5.3 is executed for 11 times in a circulating way, and the feature graphs corresponding to the 11 surface components are respectively subjected to weighting processing to obtain the feature f after the attention mechanism processing_jAnd j ranges from 1 to 11, and this feature is used to cascade with the resolution map of the corresponding face component.

As shown in the formula 5, as shown in the formula,

wherein

representing element-by-element multiplication.

Cascading to obtain the output value f of the final characteristic fusion network_FusionAs shown in the formula 6, the above-mentioned,

7. The method according to claim 1, wherein step 6 is specifically:

the fused feature map f obtained in the step 5.6_FusionInputting the data into a decoder for decoding, and adding an anti-convolution layer for up-sampling processing after network convolution, normalization and ReLU activation; finally obtaining a result I after convolution by 3x3_SR. While using a jump connection to convert a low resolution image I_LRAnd an image I after a coarse super-resolution process_SR1And the output result of the characteristic fusion module is spliced, so that a good reconstruction effect can be realized.

8. The method according to claim 1, characterized in that said step 7 is in particular,

step 7.1, define the joint loss function, as shown in equation 7,

wherein, the loss function adopts a mean square error loss function, N represents the number of images in the training set, hr⁽ⁱ⁾Showing the ith low-resolution image pairHigh resolution images should be used.

Step 7.2, training set image I output in step 2_SR1Original image hr, original analysis chart

Analytic graph p extracted through network and final result I_SRInputting the image into a pixel-by-pixel loss function, generating a high-resolution image through pixel-by-pixel loss function processing, and continuously iterating to minimize the loss function;