Disclosure of Invention
The invention aims to provide a face super-resolution method based on prior information and an attention fusion mechanism, which solves the problem of insufficient use of the prior information of a face in the prior art, and effectively improves the quality of face image super-resolution reconstruction, including PSNR and SSIM.
The technical scheme adopted by the invention is that a face super-resolution method based on prior information and attention fusion mechanism is implemented according to the following steps:
step 1, making an original image data set and performing data enhancement, then inputting a face image subjected to data enhancement processing into a degradation model to be processed to obtain a low-resolution image data set, performing bicubic up-sampling on the low-resolution image to obtain an image with the same size as a high-resolution image as the low-resolution data set, and finally dividing the data set into a training set and a testing set;
step 2, inputting the image obtained in the step 1 into a rough super-resolution network for processing to obtain an image I after rough super-resolution processingSR1;
Step 3, training set image I obtained in step 2SR1Inputting the data into a coder network for feature extraction to obtain a feature map f;
step 4, the image I obtained in the step 2 is processedSR1Inputting the prior information into a prior information extraction network to extract the prior information to obtain an analytic graph p, wherein the prior information extraction network consists of ResNet and a stacked hourglass network;
step 5, inputting the feature diagram f obtained in the step 3 and the analysis diagram p obtained in the step 4 into a feature fusion network for fusion of the analysis diagram and the feature diagram to obtain a fused feature diagram fFusion。
Step 6, the characteristic diagram f obtained in the step 5 is processedFusionInputting the data into a decoder network for decoding to obtain a final super-resolution processing result ISR。
Step 7, the I obtained in the step 2SR1And the original image is input into a pixel-by-pixel loss function to obtain l1The analytic graph p obtained in the step 4 and the original data set are usedThe analytic graph p-in is input into a pixel-by-pixel loss function to be calculated to obtain l2And (4) obtaining a final result I obtained in the step 6SRAnd the original image is input into a pixel-by-pixel loss function to be calculated to obtain l3Adding the above loss functions to obtain Ltotal. Continuously iterating to minimize the loss function, and finally generating a super-resolution network model after training;
and 8, setting hyper-parameters of the super-resolution network model, inputting the preprocessed test set image in the step 1 into the super-resolution network model, and finally generating a high-resolution face image with clear detail texture and better effect through residual network processing and loss function minimized iteration.
The present invention is also characterized in that,
the step 1 specifically comprises the following steps:
step 1.1, downloading a CelebAMask-HQ data set, wherein a total amount of 30000 high-definition face images of 1024x1024 are obtained, and cutting the images into 128x128 images by using a resize function of matlab as the size of an original image, so that the calculation amount is reduced.
And step 1.2, carrying out mirror image turning on all images in the data set to obtain 60000 human face images and obtain a human face data set with enhanced data.
And step 1.3, performing degradation processing on the data set obtained in the step 1.2, inputting all images in the data set into a prepared degradation model in advance to generate a corresponding low-resolution face image, and simulating a degradation process in reality.
The degradation function is particularly complex and the super-resolution is difficult because many factors (including blur, noise, etc.) in the actual environment can reduce the resolution of the image. Therefore, in the existing super-resolution technology research, the degradation process is simplified, only blurring, down-sampling and noise are considered, as shown in formula 1,
wherein k represents a fuzzy core, which means that the fuzzy core performs convolution operation on the high-resolution face image, ↓isdownsampling operation, s represents a downsampling factor, and n represents noise. Thus, the degradation process can be described as blurring the high resolution face image, then 8 times down-sampling the blurred image, and then adding noise to the resulting image to obtain a degraded low resolution face image with a size of 16x 16.
Step 1.4, carrying out double-thrice upsampling operation on the low-resolution face image obtained in the step 1.3 to obtain a low-resolution face image I with the size consistent with that of the original imageLRAnd the size is 128x 128.
Step 1.5, according to 6: 2: 2 divide the data set in step 1.4 into a training set, a validation set and a test set. Wherein 36000 images are in the training set, and 12000 images are in the verification set and the test set.
The step 2 specifically comprises the following steps:
for the low-resolution face image I obtained in the step 1.5LRPerforming a coarse super-resolution process, i.e. ILRThe image is led into a CoarseSRNet network to be processed to obtain ISR1(ii) a As shown in the formula 2, as shown in the formula,
ISR1=CoarseSRNet(ILR) (2)
wherein ILRRepresenting the low resolution image after a bicubic up-sampling, CoarseSRNet represents the coarse super-resolution network employed.
The CoarseSRNet network in the step 2 adopts a 3x3 convolution kernel and a ReLU activation function, 64 filters are used for generating 64 feature maps, and finally, a result I after rough super resolution is obtained through 3x3 convolutionSR1Its size remains 128x 128.
The step 3 specifically comprises the following steps:
as shown in the formula 3, as shown in the formula,
f=Encoder(ISR1) (3)
step 3.1, the I obtained in step 2SR1Inputting the data into a feature extraction network for feature extraction, wherein the feature extraction network uses an encoder structure. The encoder uses 64 convolution kernels of 3 × 3 with a step size of 2, and performs a batch normalization operation on the input image ISR1Down samplingAnd obtaining a 64x64 size feature map of 64 channels from 64x64, and realizing the mapping from the image space to the feature space.
And 3.2, combining an attention mechanism and a residual block to form a residual attention network to extract features. And (4) inputting the feature map obtained in the step (3.1) into a residual error attention network to extract deep features, so as to obtain a multi-channel feature map.
And 3.3, inputting the characteristic diagram obtained in the step 3.2 into a 3x3 convolution layer, and obtaining an extracted characteristic diagram f through convolution, normalization and Tanh activation function. The profile channel is 64 and has dimensions 64x 64.
The step 4 is specifically that,
as shown in the formula 4, as shown in the formula,
p=PriorEstimate(ISR1) (4)
step 4.1, the result I after the rough super resolution obtained in the step 2 is processedSR1Inputting the data into a priori information extraction network, and checking I by adopting 128 convolution checks of 7x7SR1Performing convolution, and then performing normalization and ReLu operation to obtain 128 feature maps of 64x 64;
and 4.2, constructing a stacked hourglass network for prior information extraction. And stacking 4 hourglass networks for extracting the face analysis graph. In order to effectively merge features across scales and retain spatial information of different scales, the stacked hourglass network adopts a jump connection mechanism at a symmetrical layer time. The resulting features were post-processed followed by a 1x1 convolutional layer. Finally, the shared features are concatenated to two separate 1 × 1 convolutional layers to generate a landmark heat map and a parse map.
And 4.3, inputting the feature map obtained in the step 4.1 into a stacked hourglass network, and processing to obtain a face analysis map p with 128 channels, wherein the size of the face analysis map p is 128x64x 64.
The step 5 specifically comprises the following steps:
inputting the feature map f of the 64 channels obtained in the step 4.3 and the face analysis map p of the 128 channels obtained in the step five into a feature fusion network for fusion of the analysis maps and the feature maps to obtain a fused feature map fFusionWith a size of 64x64x11, for a total of 11 channels, one for each channelThe features respectively correspond to a face component, namely face skin, left eyebrow, right eyebrow, left eye, right eye, left ear, right ear, nose, mouth, upper lip and lower lip, and 11 face components in total.
Step 5.1, constructing a feature fusion network, which mainly comprises three parts, wherein the first part is formed by 1x1 convolution and is used for carrying out dimension reduction processing on a face analysis graph; the second part is composed of an attention module CBAM, and the feature maps are weighted through a channel attention mechanism and a space attention mechanism to obtain feature maps describing 11 different face components; the third part is that the feature graph f after final fusion is obtained by respectively adding and averaging the feature graph describing different face components and the analysis graphFusion。
Step 5.2, using 11 convolution kernels of 1x1 to reduce the dimension of the 128-channel face analysis graph p obtained in the step 4.3 to 11 channels to obtain pjThe value range of j is 1 to 11, which respectively represent an analysis graph corresponding to a face component. Implementation-specific loss function l3And (4) restraining.
And 5.3, processing the feature map by adopting an attention mechanism to obtain the feature map subjected to weighting processing aiming at each face component, and then cascading.
An attention module is formed through a serial channel attention mechanism and a space attention mechanism, the importance degrees of different space positions and different channels in each feature are automatically obtained through a learning mode, and the useful features are improved and the features which are not important to the current task are restrained by multiplying different weights.
Step 5.4, step 5.3 is executed for 11 times in a circulating way, and the feature graphs corresponding to the 11 surface components are respectively subjected to weighting processing to obtain the feature f after the attention mechanism processingjThe size of which is 64x64x64, and j ranges from 1 to 11, and this feature is used to cascade with the resolution map of the corresponding facial component.
Step 5.5, the face analysis picture p obtained in the step 5.2 is used
jThe characteristic diagram f of the corresponding subscript processed by the attention mechanism obtained in the step 5.4
jThe weighted average operation is carried out on the obtained data,obtaining a fused feature map
Its size is 64x64x 1. As shown in the formula 5, as shown in the formula,
wherein
Representing the characteristics after the fusion of the jth channel, Mean representing the cross-channel averaging operation, Cbam representing the pair f
jThe attention-weighting process is performed so that,
representing element-by-element multiplication.
Step 5.6, the fused characteristic diagram obtained in the step 5.5 is processed
Cascading to obtain the output value f of the final characteristic fusion network
FusionThe size of which is 64x64x11, as shown in equation 6,
wherein f is
FusionRepresenting the output of the final feature fusion network, cat represents the concatenation operation,
the corresponding fusion characteristics of the j surface components are shown. j from 1 to 11 denotes 11 face components.
The step 6 is specifically that,
the fused feature map f obtained in the step 5.3FusionInputting the data into a decoder for decoding, adding an deconvolution layer after network convolution, normalization and ReLU activation for up-sampling processing. Finally obtaining a result I after convolution by 3x3SR. While using a jump connection to convert a low resolution image ILRAnd an image I after a coarse super-resolution processSR1And the reconstruction effect can be better realized by splicing with the output result of the feature fusion module.
The step 7 is specifically that,
step 7.1, define the joint loss function, as shown in equation 7,
wherein, the loss function adopts a mean square error loss function, N represents the number of images in the training set, hr
(i)The high resolution image corresponding to the ith low resolution image is shown.
Showing the result of the i-th image after the rough super-resolution processing.
Representing the true analytic graph, p, corresponding to the ith image
(i)Representing a real face analysis graph obtained by the ith image through a prior information estimation network;
and the final result obtained after the ith image is subjected to super-resolution processing is shown.
Step 7.2, I output in
step 2
SR1Original image hr, original analysis chart
Analytic graph p extracted through network and final result I
SRAnd inputting the image into a pixel-by-pixel loss function, and generating a high-resolution image through pixel-by-pixel loss function processing. The loss function is continuously minimized iteratively.
Step 7.3, continuously iterating step 7.2 to obtain a joint loss function LtotalUsing the minimum group of weight parameters as the trained model parametersAnd counting to obtain the trained super-resolution network model.
The step 8 is specifically that,
and (3) training a model by using an RMSprop algorithm, inputting the test set data preprocessed in the step (1) into the model generated in the step (7.3), and finally generating a super-resolution processed high-definition face image through residual error network processing and joint loss function minimum iteration.
The invention has the beneficial effects that:
(1) the method of the invention introduces a channel attention mechanism into the residual block to extract the characteristics, so that the network can learn purposefully, adjust the characteristic channel information in a self-adaptive manner, enhance the expression capability of the characteristics and help to recover more details such as contour texture.
(2) The method of the invention fuses the feature map and the analysis map by using an attention mechanism, respectively fuses the analysis map and the feature map corresponding to different facial components, increases the guiding function of the analysis map on the super resolution of the face image, more effectively utilizes the extracted useful features and inhibits useless features. The network can accurately distribute the computing resources according to the weight, the reconstruction efficiency is improved, and the reconstruction effect is enhanced.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention discloses a face super-resolution method based on prior information and attention fusion mechanism, which is implemented according to the following steps:
step 1, an original image data set is made and data enhancement is carried out, then a face image after data enhancement processing is input into a degradation model to be processed to obtain a low-resolution image data set, then the low-resolution image is subjected to bicubic up-sampling to obtain an image with the same size as a high-resolution image as the low-resolution data set, and the data set is divided into a training set and a testing set.
Step 2, inputting the image obtained in the step 1 into a rough super-resolution network for processing to obtain an image I after rough super-resolution processingSR1;
The step 2 specifically comprises the following steps: for the low-resolution face image I obtained in the step 1.5LRPerforming a coarse super-resolution process, i.e. ILRThe image is led into a CoarseSRNet network to be processed to obtain ISR1(ii) a As shown in the formula 2, as shown in the formula,
ISR1=CoarseSRNet(ILR) (2)
wherein ILRRepresenting the low resolution image after a bicubic up-sampling, CoarseSRNet represents the coarse super-resolution network employed. The LR image is subjected to rough super resolution processing.
Step 3, training set image I obtained in step 2SR1Inputting the data into an encoder network for feature extraction to obtain a feature map f. As shown in the formula 3, as shown in the formula,
f=Encoder(ISR1) (3)
the method specifically comprises the following steps:
step 3.1, the I obtained in step 2SR1Inputting the data into a feature extraction network for feature extraction, wherein the feature extraction network uses an encoder structure. The size of the parse graph is downsampled to 64x64, taking into account the computational cost. To make it specialThe characteristic size is consistent, 64 convolution kernels of 3 multiplied by 3 are used by an encoder, the step length is 2, and an input image I is subjected to batch normalization operationSR1Down-sampling to 64x64 to obtain a 64x64 size feature map of 64 channels, and mapping from the image space to the feature space is realized.
And 3.2, combining an attention mechanism and a residual block to form a residual attention network to extract features. And (4) inputting the feature map obtained in the step (3.1) into a residual error attention network to extract deep features, so as to obtain a multi-channel feature map.
And 3.3, inputting the characteristic diagram obtained in the step 3.2 into a 3x3 convolution layer, and obtaining an extracted characteristic diagram f through convolution, normalization and Tanh activation function. The profile channel is 64 and has dimensions 64x 64.
Step 4, the image I obtained in the step 2 is processedSR1The method comprises the following steps of inputting the prior information into a prior information extraction network to extract the prior information to obtain an analytic graph p, wherein the prior information extraction network consists of ResNet and a stacked hourglass network, and specifically comprises the following steps:
as shown in the formula 4, as shown in the formula,
p=PriorEstimate(ISR1) (4)
step 4.1, the result I after the rough super resolution obtained in the step 2 is processedSR1Inputting the data into a priori information extraction network, and checking I by adopting 128 convolution checks of 7x7SR1Performing convolution, and then performing normalization and ReLu operation to obtain 128 feature maps of 64x 64;
and 4.2, extracting an analytic graph by adopting a stacked hourglass network. And stacking 4 hourglass networks for extracting the face analysis graph. The resulting features were post-processed followed by a 1x1 convolutional layer. Finally, the shared features are concatenated to two separate 1 × 1 convolutional layers to generate a landmark heat map and a parse map.
And 4.3, inputting the feature map obtained in the step 4.1 into a stacked hourglass network, and processing to obtain a face analysis map p with 128 channels, wherein the size of the face analysis map p is 128x64x 64.
Step 5, inputting the feature diagram f obtained in the step 3 and the analysis diagram p obtained in the step 4 into a feature fusion network for fusion of the analysis diagram and the feature diagram,obtaining a fused feature map fFusion. The method specifically comprises the following steps:
inputting the feature map f of the 64 channels obtained in the step 4.3 and the face analysis map p of the 128 channels obtained in the step five into a feature fusion network for fusion of the analysis maps and the feature maps to obtain a fused feature map fFusionThe size of the face component is 64x64x11, a total of 11 channels are provided, and each channel is characterized by corresponding to a face component, namely face skin, left eyebrow, right eyebrow, left eye, right eye, left ear, right ear, nose, mouth, upper lip, lower lip and a total of 11 face components.
Step 5.1, constructing a feature fusion network, which mainly comprises three parts, wherein the first part is formed by 1x1 convolution and is used for carrying out dimension reduction processing on a face analysis graph; the second part is composed of an attention module CBAM, and the feature maps are weighted through a channel attention mechanism and a space attention mechanism to obtain feature maps describing 11 different face components; the third part is that the feature graph f after final fusion is obtained by respectively adding and averaging the feature graph describing different face components and the analysis graphFusion。
Step 5.2, using 11 convolution kernels of 1x1 to reduce the dimension of the 128-channel face analysis graph p obtained in the step 4.3 to 11 channels to obtain pjThe value range of j is 1 to 11, which respectively represent an analysis graph corresponding to a face component. Implementation-specific loss function l3And (4) restraining.
And 5.3, processing the feature map by adopting an attention mechanism to obtain the feature map subjected to weighting processing aiming at each face component, and then cascading.
An attention module is formed through a serial channel attention mechanism and a spatial attention mechanism, the feature maps of 64 channels obtained in the step 4.3 are input into the attention module to be subjected to weight marking, the importance degree of each feature channel is automatically obtained through a learning mode, each channel is multiplied by different weights, then the feature maps which are noticed by the channels are input into the spatial attention module, the importance degrees of different spatial positions in each feature are automatically obtained through a similar learning mode, and the different spatial positions in the feature maps are multiplied by different weights to promote useful features and restrain features which are not important for the current task.
Step 5.4, step 5.3 is executed for 11 times in a circulating way, and the feature graphs corresponding to the 11 surface components are respectively subjected to weighting processing to obtain the feature f after the attention mechanism processingjThe size of which is 64x64x64, and j ranges from 1 to 11, and this feature is used to cascade with the resolution map of the corresponding facial component.
Step 5.5, the face analysis picture p obtained in the step 5.2 is used
jThe characteristic diagram f of the corresponding subscript processed by the attention mechanism obtained in the step 5.4
jWeighted average operation is carried out to obtain a fused characteristic diagram
Its size is 64x64x 1. As shown in the formula 5, as shown in the formula,
wherein
Representing the characteristics after the fusion of the jth channel, Mean representing the cross-channel averaging operation, Cbam representing the pair f
jThe attention-weighting process is performed so that,
representing element-by-element multiplication.
Step 5.6, the fused characteristic diagram obtained in the step 5.5 is processed
Cascading to obtain the output value f of the final characteristic fusion network
FusionThe size of which is 64x64x11, as shown in equation 6,
wherein f is
FusionRepresenting the output of the final feature fusion network, cat represents the concatenation operation,
the corresponding fusion characteristics of the j surface components are shown. j from 1 to 11 denotes 11 face components.
Step 6, the characteristic diagram f obtained in the step 5 is processedFusionInputting the data into a decoder network for decoding to obtain a final super-resolution processing result ISR。
Step 7, the I obtained in the
step 2
SR1And the original image is input into a pixel-by-pixel loss function to obtain l
1The analytic graph p obtained in the step 4 and the analytic graph in the original data set are used
Input into a pixel-by-pixel loss function to obtain l
2And (4) obtaining a final result I obtained in the step 6
SRAnd the original image is input into a pixel-by-pixel loss function to be calculated to obtain l
3Adding the above loss functions to obtain L
total. Continuously iterating to minimize loss function, training to generate super
Distinguishing a network model;
the step 7 is specifically that,
step 7.1, define the joint loss function, as shown in equation 7,
wherein, the loss function adopts a mean square error loss function, N represents the number of images in the training set, hr
(i)The high resolution image corresponding to the ith low resolution image is shown.
Showing the result of the i-th image after the rough super-resolution processing.
Representing the true analytic graph, p, corresponding to the ith image
(i)Representing a real face analysis graph obtained by the ith image through a prior information estimation network;
and the final result obtained after the ith image is subjected to super-resolution processing is shown.
Step 7.2, I output in
step 2
SR1Original image hr, original analysis chart
Analytic graph p extracted through network and final result I
SRAnd inputting the image into a pixel-by-pixel loss function, and generating a high-resolution image through pixel-by-pixel loss function processing. And continuously iterating to minimize the loss function, and training to finally generate the super-resolution network model.
Step 7.3, continuously iterating step 7.2 to obtain a joint loss function LtotalAnd taking the minimum group of weight parameters as the trained model parameters to obtain the trained super-resolution network model.
And 8, setting hyper-parameters of the super-resolution network model, inputting the preprocessed test set image in the step 1 into the super-resolution network model, and finally generating a high-resolution face image with clear detail texture and better effect through residual network processing and loss function minimized iteration.
Examples
A face super-resolution method based on prior information and attention fusion mechanism is disclosed, as shown in FIG. 1, and is implemented specifically according to the following steps:
step 1, an original image data set is made and data enhancement is carried out, then a face image after data enhancement processing is input into a degradation model to be processed to obtain a low-resolution image data set, then the low-resolution image is subjected to bicubic up-sampling to obtain an image with the same size as a high-resolution image as the low-resolution data set, and the data set is divided into a training set and a testing set. The method specifically comprises the following steps:
step 1.1, download the CelebAMask-HQ dataset and crop the image to 128x128 as the original image size using the resize function of matlab.
And 1.2, carrying out mirror image overturning on all images in the data set to carry out data enhancement.
And step 1.3, inputting the data set obtained in the step 1.2 into a prepared degradation model to generate a corresponding low-resolution face image, and simulating a degradation process in reality. As shown in the formula 1, as shown in the figure,
wherein k represents a fuzzy kernel, which means that the fuzzy kernel performs convolution operation on the high-resolution face image, ↓ represents downsampling operation, s represents a downsampling factor, n represents noise, and the low-resolution face image after being processed and degraded is obtained, and the size of the low-resolution face image is 16x 16.
Step 1.4, carrying out double-thrice upsampling operation on the low-resolution face image obtained in the step 1.3 to obtain a low-resolution face image I with the size consistent with that of the original imageLRAnd the size is 128x 128.
And 1.5, dividing the data set in the step 1.4 into a training set, a verification set and a test set. Wherein 36000 images are in the training set, and 12000 images are in the verification set and the test set.
Step 2, inputting the image obtained in the step 1 into a rough super-resolution network for processing to obtain an image I after rough super-resolution processingSR1;
The step 2 specifically comprises the following steps:
in order to take the accuracy of the subsequent feature extraction and prior information extraction into consideration, the low-resolution face image I obtained in step 1.5 is firstly subjected toLRPerforming a coarse super-resolution process, i.e. ILRThe image is led into a CoarseSRNet network to be processed to obtain ISR1(ii) a As shown in the formula 2, as shown in the formula,
ISR1=CoarseSRNet(ILR) (2)
wherein ILRRepresenting the low resolution image after a bicubic up-sampling, CoarseSRNet represents the coarse super-resolution network employed. The general image super-resolution network SRCNN is improved to obtain a simplified SRCNN serving as CoarseSRNet, and LR images are subjected to rough super-resolution processing.
The CoarseSRNet network in the step 2 adopts a 3x3 convolution kernel and a ReLU activation function, 64 filters are used for generating 64 feature maps, and finally, a result I after rough super resolution is obtained through 3x3 convolutionSR1Its size remains 128x 128.
Step 3, training set image I obtained in step 2SR1Inputting the data into the encoder network to perform feature extraction to obtain a feature map f, as shown in fig. 2.
The method specifically comprises the following steps:
step 3.1, the I obtained in step 2SR1Inputting the data into a feature extraction network for feature extraction, wherein the feature extraction network uses an encoder structure. As shown in the formula 3, as shown in the formula,
f=Encoder(ISR1) (3)
the size of the parse graph is downsampled to 64x64, taking into account the computational cost. Therefore, in order to make the feature size consistent, the encoder uses 64 convolution kernels of 3 × 3 with a step size of 2, and then performs a batch normalization operation on the input image ISR1Down-sampling to 64x64 to obtain a 64x64 size feature map of 64 channels, and mapping from the image space to the feature space is realized.
And 3.2, inspiring by a residual error attention module (RCAN), and combining an attention mechanism and a residual error block to form a residual error attention network to extract features. And (4) inputting the feature map obtained in the step (3.1) into a residual error attention network to extract deep features, so as to obtain a multi-channel feature map.
The step 3.2 is specifically as follows:
step 3.2.1, the traditional deep learning method adopts an equalization processing method in channel domains with different importance, so that a large amount of computing resources are wasted in unimportant features. In order to solve the problems, a residual attention block RAB is constructed, and a channel attention mechanism is introduced into the residual attention block, so that a network can learn purposefully, effective features can be extracted more effectively, and useless features are suppressed. Capturing weight information implied by a channel domain through an attention mechanism so as to more efficiently distribute computing resources and accelerate network convergence;
step 3.2.2, combine 12 residual attention blocks RAB to form a residual attention network, as shown in fig. 2.
And 3.2.3, inputting the multi-channel feature map f obtained in the step 3.1 into a residual error attention network to extract deep features, and obtaining the extracted deep features.
And 3.3, inputting the characteristic diagram obtained in the step 3.2.3 into a 3x3 convolution layer, and obtaining an extracted characteristic diagram f through convolution, normalization and Tanh activation function. The profile channel is 64 and has dimensions 64x 64.
Step 4, the image I obtained in the step 2 is processedSR1The prior information is input into a prior information extraction network to extract the prior information to obtain an analytic graph p, as shown in fig. 3, wherein the prior information extraction network is composed of a ResNet and a stacked hourglass network, and specifically comprises the following steps:
the prior information of the human face mainly comprises a human face landmark image, a landmark heat map, a human face analytic map and the like, and due to the fact that under the condition that the resolution of the image is too small, key points of the human face are not accurate enough, the subsequent prior information can influence the process of guiding the super-resolution of the human face. Therefore, the face analysis graph is selected as the face prior information instead of the face key point, the above three kinds of face prior information are shown in figure 5,
step 4.1, the result I after the rough super resolution obtained in the step 2 is processedSR1The global feature is input into a priori information extraction network, and generally, the larger the convolution kernel is, the larger the receptive field is, and the better the obtained global feature is. So 128 convolution checks I of 7x7 are usedSR1Convolution is carried out, and then 128 feature maps of 64x64 are obtained through normalization and ReLu operation, as shown in formula 4,
p=PriorEstimate(ISR1) (4)
and 4.2, constructing a stacked hourglass network for prior information extraction. Inspired by the latest success of the stacked hourglass network in human body posture estimation, the stacked hourglass network is adopted to extract an analytic graph. And stacking 4 hourglass networks for extracting the face analysis graph. Since the analytic graph is two-dimensional, in the a priori estimation network, all features except the last layer are shared between the two tasks. In order to effectively merge features across scales and retain spatial information of different scales, the stacked hourglass network adopts a jump connection mechanism at a symmetrical layer time. The resulting features were post-processed followed by a 1x1 convolutional layer. Finally, the shared features are concatenated to two separate 1 × 1 convolutional layers to generate a landmark heat map and a parse map.
And 4.3, inputting the feature map obtained in the step 4.1 into a stacked hourglass network, and processing to obtain a face analysis map p with 128 channels, wherein the size of the face analysis map p is 128x64x 64.
Step 5, inputting the feature diagram f obtained in the step 3 and the analysis diagram p obtained in the step 4 into a feature fusion network for fusion of the analysis diagram and the feature diagram to obtain a fused feature diagram fFusion. As shown in fig. 4, the feature fusion module specifically includes:
inputting the feature map f of the 64 channels obtained in the step 4.3 and the face analysis map p of the 128 channels obtained in the step five into a feature fusion network for fusion of the analysis maps and the feature maps to obtain a fused feature map fFusionThe size of the face component is 64x64x11, there are 11 channels, and each channel is characterized by a face component, which is face skin, left eyebrow, right eyebrow, left eye, right eye, left ear, right ear, nose, mouth, upper lip, lower lip, and 11 face components in total, as shown in fig. 6.
Step 5.1, constructing a feature fusion network, which mainly comprises three parts, wherein the first part is formed by 1x1 convolution and is used for carrying out dimension reduction processing on a face analysis graph; the second part is composed of an attention module CBAM, and the feature maps are weighted through a channel attention mechanism and a space attention mechanism to obtain feature maps describing 11 different face components; the third part is that the feature graph f after final fusion is obtained by respectively adding and averaging the feature graph describing different face components and the analysis graphFusion。
Step 5.2, using 11 convolution kernels of 1x1 to reduce the dimension of the 128-channel face analysis graph p obtained in the step 4.3 to 11 channels to obtain pjThe value range of j is 1 to 11, which respectively represent an analysis graph corresponding to a face component. Implementation-specific loss function l3And (4) restraining.
Step 5.3, in existing approaches, the facial structure may not be fully exploited, as features of different facial components are typically extracted by a shared network. Thus, a priori information present in different face components may be ignored by the network. Therefore, different facial regions should be restored separately for better performance. Therefore, the feature map is processed by an attention mechanism to obtain the feature map after weighting processing is performed on each face component, and then cascading is performed.
An attention module is formed through a serial channel attention mechanism and a spatial attention mechanism, the feature maps of 64 channels obtained in the step 4.3 are input into the attention module to be subjected to weight marking, the importance degree of each feature channel is automatically obtained through a learning mode, each channel is multiplied by different weights, then the feature maps which are noticed by the channels are input into the spatial attention module, the importance degrees of different spatial positions in each feature are automatically obtained through a similar learning mode, and the different spatial positions in the feature maps are multiplied by different weights to promote useful features and restrain features which are not important for the current task.
Step 5.4, step 5.3 is executed for 11 times in a circulating way, and the feature graphs corresponding to the 11 surface components are respectively subjected to weighting processing to obtain the feature f after the attention mechanism processingjThe size of which is 64x64x64, and j ranges from 1 to 11, and this feature is used to cascade with the resolution map of the corresponding facial component.
Step 5.5, the face analysis picture p obtained in the step 5.2 is used
jThe characteristic diagram f of the corresponding subscript processed by the attention mechanism obtained in the step 5.4
jWeighted average operation is carried out to obtain a fused characteristic diagram
Its size is 64x64x 1. As shown in the formula 5, as shown in the formula,
wherein
Representing the characteristics after the fusion of the jth channel, Mean representing the cross-channel averaging operation, Cbam representing the pair f
jThe attention-weighting process is performed so that,
representing element-by-element multiplication.
Step 5.6, the fused characteristic diagram obtained in the step 5.5 is processed
Cascading to obtain the output value of the final feature fusion network
The size of which is 64x64x11, as shown in equation 6,
wherein f is
FusionRepresenting the output of the final feature fusion network, cat represents the concatenation operation,
the corresponding fusion characteristics of the j surface components are shown. j from 1 to 11 denotes 11 face components.
Step 6, the characteristic diagram f obtained in the step 5 is processedFusionInputting the data into a decoder network for decoding to obtain a final super-resolution processing result ISR. In particular to a method for preparing a high-performance nano-silver alloy,
the fused feature map f obtained in the step 5.3FusionInput to a decoder for decoding, decoder and encoderThe structure is similar, and the method is also formed by combining residual blocks, and only adds an deconvolution layer to perform upsampling processing after network convolution, normalization and ReLU activation. Finally obtaining a result I after convolution by 3x3SR. At the same time in order to better utilize ILRAbundant low-frequency information of image contained in shallow image feature effectively utilizes low-resolution image I by utilizing jump connectionLRAnd an image I after a coarse super-resolution processSR1And the low-frequency information is spliced with the output result of the feature fusion module, so that the low-frequency information can be directly transmitted to the tail end of the module through jump connection, and a better reconstruction effect is realized.
Step 7, the I obtained in the step 2SR1And the original image is input into a pixel-by-pixel loss function to obtain l1Inputting the analytic graph p obtained in the step 4 and the analytic graph p-in the original data set into a pixel-by-pixel loss function to calculate to obtain l2And (4) obtaining a final result I obtained in the step 6SRAnd the original image is input into a pixel-by-pixel loss function to be calculated to obtain l3Adding the above loss functions to obtain Ltotal. Continuously iterating to minimize the loss function, and finally generating a super-resolution network model after training;
the step 7 is specifically that,
step 7.1, define the joint loss function, as shown in equation 7,
wherein, the loss function adopts a mean square error loss function, N represents the number of images in the training set, hr
(i)The high resolution image corresponding to the ith low resolution image is shown.
Showing the result of the i-th image after the rough super-resolution processing.
Representing the true analytic graph, p, corresponding to the ith image
(i)Is shown asThe i images are subjected to a real face analysis graph obtained by a priori information estimation network;
and the final result obtained after the ith image is subjected to super-resolution processing is shown.
Step 7.2, I output in
step 2
SR1Original image hr, original analysis chart
Analytic graph p extracted through network and final result I
SRAnd inputting the image into a pixel-by-pixel loss function, and generating a high-resolution image through pixel-by-pixel loss function processing. And continuously iterating to minimize the loss function, and training to finally generate the super-resolution network model.
Step 7.3, continuously iterating step 7.2 to obtain a joint loss function LtotalAnd taking the minimum group of weight parameters as the trained model parameters to obtain the trained super-resolution network model.
And 8, setting hyper-parameters of the super-resolution network model, inputting the preprocessed test set image in the step 1 into the super-resolution network model, and finally generating a high-resolution face image with clear detail texture and better effect through residual network processing and loss function minimized iteration.
The step 8 is specifically that,
the model was trained using the RMSprop algorithm with an initial learning rate of 2.5 x 10-4 and a minimum batch of 14. λ is set to 0.8 empirically. The training was run with a batch size of 8 and a set learning rate of 10-31.2 × 10 per iteration5The secondary reduction is half;
and (3) inputting the test set data preprocessed in the step (1) into the model generated in the step (7.3), and finally generating a super-resolution processed high-definition face image through residual error network processing and combined loss function minimum iteration.