CN114820310A

CN114820310A - Semantic feature-based face super-resolution reconstruction method and system

Info

Publication number: CN114820310A
Application number: CN202210426417.9A
Authority: CN
Inventors: 金枝; 齐浩然; 张欢荣
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-07-29

Abstract

The invention discloses a face super-resolution reconstruction method and a face super-resolution reconstruction system based on semantic features, wherein the method comprises the following steps: step one, a quality degradation stage: synthesizing the high-resolution face of the training set into a low-resolution face image containing complex noise and fuzzy distribution; step two, a generation stage: the system comprises a synthesis module, an amplification module and an integration module, and not only improves the generalization capability of the FSR network model under the multi-degradation mode, but also enhances the SR face perception quality reconstructed under the multi-degradation mode.

Description

Semantic feature-based face super-resolution reconstruction method and system

Technical Field

The invention relates to a face super-resolution reconstruction method and system based on semantic features.

Background

Super-resolution reconstruction (SR) is an important area of image quality enhancement research. The technique may recover more texture information from a Low Resolution (LR) image and form a High Resolution (HR) image. Face super-resolution reconstruction (FSR) is an application branch of SR technology. This branch is directed to recovering a high resolution face morphology from a face at a low resolution. The FSR technology can be applied to the auxiliary related advanced visual tasks, such as the application of face recognition, face correction, security video analysis and other related biological characteristics.

Deep learning techniques are widely explored in face super-resolution reconstruction. On the basis of the research of applying the convolutional neural network structure to the SR reconstruction of the natural image, the research of the FSR method is also advanced. Existing FSR methods can be divided into two major categories, supervised and unsupervised: the supervised FSR method is based on a paired LR-HR face data set to complete learning, namely an LR face defined by a training set outputs a reconstructed SR face through a convolution network structure enhancement feature, and is supervised and optimized by an HR face in the training set; the FSR method under unsupervised utilizes a mode of data enhancement or network function relationship enrichment, does not need a paired LR-HR data set, can complete learning only by using a single face image in a training set, and can reduce the dependence of network model training on data compared with a supervised method.

In addition, compared with a natural image, the human face image contains unique prior information, such as semantic graphs, facial marker points, edges and other biological features, and the prior information plays a unique role in some FSR methods. Some existing FSR methods introduce relevant feature information and fuse the feature information as network model input into an FSR reconstruction network to play a role in guiding face reconstruction. Meanwhile, some methods try to predict relevant prior information and establish constraint in the face reconstruction process, and enhance the expression of the prior information in the FSR network model. The prior information specific to the face images can improve the directivity of the FSR method in face reconstruction and improve the reconstruction effect of the method on the face structure and related attributes.

In the prior art, a supervision method needs paired LR-HR face data set support model training, an LR face is obtained by a real face through a fixed down-sampling method, and the single degradation process hypothesis cannot adapt to the face reconstruction requirement in a multi-degradation mode. In unsupervised learning, the reconstructed SR face has deformation or distortion phenomenon without reference of real face, which affects visual effect. Meanwhile, in a multi-degradation mode, some prior information such as facial landmark points, edges and the like is inaccurate or unavailable, so that the guided FSR model has a poor reconstruction effect. The problem to be solved urgently in the field is to select reasonable face prior information for efficiently enhancing the reconstruction performance of the FSR model in the multi-degradation mode.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a face super-resolution reconstruction method and system based on semantic features.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a face super-resolution reconstruction method based on semantic features comprises the following steps:

step one, a quality degradation stage: synthesizing the high-resolution face of the training set into a low-resolution face image containing complex noise and fuzzy distribution;

step two, a generation stage: the generation stage comprises a rough reconstruction process and a fine reconstruction process, the rough reconstruction process is used for roughly amplifying the low-resolution face image to the target resolution, the fine reconstruction process is used for extracting semantic features based on the result of the rough reconstruction process, then the semantic features and the general features are integrated under a semantic attention module, and finally a super-resolution face reconstruction result is formed and used as the output of the network.

Preferably, the degradation phase comprises the steps of:

step S11, the high resolution face image HR in the training set firstly passes through a residual error network group, the residual error network group comprises a plurality of residual error network blocks and a plurality of pooling layers, the pooling layers are positioned between two adjacent residual error network blocks, the pooling layers have the function of down-sampling, when the high resolution face image HR is down-sampled to 4x4, the structure formed by the sub-pixel layers and the related residual error network blocks restores the up-sampling of the image to 16x16,and finally integrated into three-channel images as degraded low-resolution face images LR _deg ；

Step S12, ensuring LR _deg The identity characteristics of the image are not interfered, and the image is sampled with the standard image LR after the interpolation _deg A content loss function L is established between _pix-LR ：L _pix-LR ＝||LR _deg -LR _bic || ₂ ；

Step S13, introducing complex noise and fuzzy information in the degradation stage to make the synthesized LR _deg The generation of the confrontation network is introduced when the content distribution is closer to the human face captured in the real environment, and then the confrontation network is processed by an LR discriminator D _LR LR (low rate of speech) human face LR captured in real environment _real Determination of synthetic LR _deg To generate an LR challenge loss function L established by the challenge process _adv-LR The following were used:

preferably, the residual network block includes a convolutional layer a, a convolutional layer B, a regularization layer C, an activation layer D, a convolutional layer E, a regularization layer F, and an activation layer G, the convolutional layer a is connected to the regularization layer C through the convolutional layer B, the regularization layer C is connected to the convolutional layer E through the activation layer D, and the convolutional layer E is connected to the activation layer G through the regularization layer F.

Preferably, the LR discriminator D _LR The spectrum regularization layer G is connected with the convolution layer G through the activation layer K, and the convolution layer L is connected with the activation layer N through the spectrum regularization layer M.

Preferably, the coarse reconstruction process is LR _deg SR face SR with rough reconstruction under target resolution output on rough reconstruction network _Coarss And establishing a related content loss function L _pix-CoarssSR ：L _pix-CoarssSR ＝||SR _Coarss -HR|| ₂ Then based on SR _Coarss Design thereinSemantic loss function L _seg-CoarssSR ：L _seg-CoarssSR ＝||Seg(SR _Coarss )-Seg(HR)|| ₂ (ii) a The Seg (-) is a pre-trained human face semantic segmentation network, the network can output human face semantic segmentation prediction results under 19 channels, and the 19 channels respectively correspond to different human face components including eyes, a nose, eyebrows, hair and a head.

Preferably, in the fine reconstruction process, the general features and the semantic features of the rough reconstructed face are integrated and enhanced through a fine reconstruction network, and finally a face super-resolution reconstruction result is output.

Preferably, the fine reconstruction network is composed of a convolutional layer M, a residual network group, a semantic feature attention network, and an integrated convolutional layer.

Preferably, the fine reconstruction process comprises the following steps:

step A21, input SR _Coarss After general features are extracted from general convolutional layers, the general features are fused with corresponding semantic feature channels and enter a semantic feature attention network together, and then mixed features containing attention are obtained;

step A22, enabling the mixed features containing attention to enter a residual error network group deepening feature expression, then entering an integrated convolution layer, and then converting the integrated convolution layer into a 3-channel image serving as a face reconstruction result SR finally output in a generation stage _Fins ；

Step A23, generating the SR of the phase output _Fins Establishing constraints, SR at first _Fine Establishing a content loss function L between standard images HR in the same training set _pix-FineSR ：L _pix-FineSR ＝||SR _Fine -HR|| ₂ ；

Step A24, SR to be reconstructed _Fine Passes through SR discriminator D _SR Integrates the features and outputs a 16x16 size tensor whose values represent D _SR The reconstruction quality of the corresponding region has a value in the range of 0-1. The closer to 1 represents the better perceived quality of the reconstruction, and the SR penalty function L of the SR face thus established _adv-SR The following were used:

L _adv-SR ＝-σ[D _SR (SR _Fine log(D _SR (SR))+(1-D _SR (SR _Fine ))log(1-D _SR (HR))]。

the system comprises a synthesis module, an amplification module and an integration module, wherein the synthesis module is connected with the integration module through the amplification module, the synthesis module is used for synthesizing a low-resolution face image containing complex noise and fuzzy distribution by a face, the amplification module is used for roughly amplifying the low-resolution face image to a target resolution, and the integration module is used for integrating semantic features and general features of the image so as to form a super-resolution face reconstruction result.

The invention has the following beneficial effects: the framework of the invention is an unsupervised model, and the quality degradation stage and the generation stage of the human face are jointly trained, so that the problem of limitation of a data set in training is solved, and the generalization of the model in multiple quality degradation modes is improved. In the degradation stage, a degradation network is designed and the degradation process of the image is learned, so that the synthesized LR face contains rich noise and fuzzy information, and the limitation on a data set is solved. In the generation stage, human face semantic features are integrated in a simplified to complex mode, the expression of the semantic features in a deep reconstruction network is enhanced by utilizing a channel attention mechanism, the visual effect of the reconstructed human face is enhanced by using reasonable human face prior information, and the problem that the human face component contour is easy to distort and deform without supervision is solved. Through experiments, the method has excellent performance in face super-resolution reconstruction under the condition of multiple degradations and has clear and accurate visual perception effect. The face super-resolution reconstruction method based on the semantic features and aiming at the multi-degradation mode combines an unsupervised learning framework and semantic feature prior information, so that the generalization capability of an FSR network model under the multi-degradation mode is improved, and meanwhile the SR face perception quality reconstructed under the multi-degradation mode is enhanced.

Drawings

FIG. 1 is a flowchart of the process of the present invention;

FIG. 2 is a schematic diagram of a residual block;

FIG. 3 shows LR discriminator D _LR Schematic structural diagram of (a);

FIG. 4 is a schematic diagram of a fine reconstruction network;

fig. 5 is a block diagram of a system.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings of the specification:

as shown in fig. 1, a face super-resolution reconstruction method based on semantic features includes the following steps:

In the degradation stage, synthesizing the high-resolution face of the training set into a low-resolution face containing complex noise and fuzzy distribution as the input of the generation stage; in the generation stage, the low-resolution image is roughly amplified to the target resolution, the semantic features are extracted according to the low-resolution image, then the semantic features and the general features are integrated under a semantic attention module, and finally a super-resolution face reconstruction result is formed and is used as the output of a network.

The degradation phase comprises the following steps:

s11, the high-resolution face image HR in the training set firstly passes through a residual error network group, the residual error network group comprises a plurality of residual error network blocks and a plurality of pooling layers, the pooling layers are positioned between two adjacent residual error network blocks, the pooling layers have a down-sampling function, when the high-resolution face image HR is down-sampled to 4x4, the image is up-sampled and restored to 16x16 size by a structure formed by a sub-pixel layer and related residual error network blocks, and finally the high-resolution face image HR is integrated into a three-channel image serving as a down-samplingLow resolution face image LR after quality _deg ；

Step S12, ensuring LR _deg The identity characteristics of the image are not interfered, and the image is sampled with the standard image LR after the interpolation _deg A content loss function L is established between _pis-LR ：L _pix-LR ＝||LR _deg -LR _bic || ₂ ；

Step S13, introducing complex noise and fuzzy information in the degradation stage to make the synthesized LR _deg The generation of the confrontation network is introduced when the content distribution is closer to the human face captured in the real environment, and then the confrontation network is processed by an LR discriminator D _LR LR (low rate of speech) human face LR captured in real environment _real Determination of synthesized LR _deg Of the synthesized LR in the degradation stage _deg Warp D _LR After the integrated features are extracted, a single-channel tensor of 16x16 size is output. The value range in the tensor is from 0 to 1, the value range represents the quality distribution of the low-resolution image area at the corresponding position, the closer the value is to 1, the better the quality degradation quality of the corresponding area is, and the LR countermeasure loss function L established in the countermeasure process is generated _adv-LR The following were used:

the LR output in the degradation stage under the combined action of the content loss function and the LR counter-loss function _deg The distribution of (a) is more complex and diverse than a single interpolation downsampling. The synthesized low-resolution face image is used as the input of a subsequent generation stage, and the generalization and reconstruction capability of the FSR model under the multi-degradation mode can be effectively improved.

As shown in fig. 2, the residual network block includes convolutional layer a1, convolutional layer B2, regularization layer C3, active layer D4, convolutional layer E5, regularization layer F6, and active layer G7, where convolutional layer a1 is connected to regularization layer C3 through convolutional layer B2, regularization layer C3 is connected to convolutional layer E5 through active layer D4, and convolutional layer E5 is connected to active layer G7 through regularization layer F6.

As shown in fig. 3, an LR discriminator D _LR Comprises an active layer H8, a convolutional layer I9, and spectrum regularizationA layering G10, an activation layer K11, a convolutional layer L12, a spectral regularization layer M13, an activation layer N14, the activation layer H8 is connected to the spectral regularization layer G10 through a convolutional layer I9, the spectral regularization layer G10 is connected to the convolutional layer L12 through an activation layer K11, and the convolutional layer L12 is connected to the activation layer N14 through a spectral regularization layer M13.

The course of the coarse reconstruction is LR _deg SR face SR with rough reconstruction under target resolution output on rough reconstruction network _Coarss The coarse reconstruction network consists of 11 residual error network blocks, the residual error network blocks are divided into two groups, and the number ratio is 8: 3, the tail part of each group of residual error blocks comprises a bilinear interpolation up-sampling layer and a rough reconstruction network L _seg-CoarssSR ＝||Seg(SR _Coarss )-Seg(HR)|| ₂ The output of the network is SR face SR of rough reconstruction under target resolution _Coarss And establishing a related content loss function L _pix-CoarssSR ：L _pix-CoarssSR ＝||SR _Coarss -HR|| ₂ Then based on SR _Coarss A semantic loss function L is designed _seg-CoarssSR ：L _seg-CoarssSR ＝||Seg(SR _Coarss )-Seg(HR)|| ₂ 。

The Seg (-) is a pre-trained human face semantic segmentation network, the network can output human face semantic segmentation prediction results under 19 channels, and the 19 channels respectively correspond to different human face components including eyes, a nose, eyebrows, hair and a head. The above SR-based _Coarss The predicted semantic segmentation will act as semantic prior information in the subsequent fine reconstruction phase.

In the fine reconstruction process, the general features and the semantic features of the rough reconstructed face are integrated and enhanced through a fine reconstruction network, and finally a face super-resolution reconstruction result is output.

As shown in fig. 4, the fine reconstruction network consists of convolutional layer M15, residual network group 16, semantic feature attention network 17, and integrated convolutional layer 18.

In the fine reconstruction process, the fine reconstruction network integrates and enhances the general features and the semantic features of the rough reconstructed face, and finally outputs a face super-resolution reconstruction result. As shown in the corresponding area of FIG. 1, the fine reconstruction network consists of a general convolutional layer and a residual error networkThe system comprises a network group, a semantic feature attention network and an integration convolutional layer (a convolutional layer for integrating information of all channels), wherein 1 semantic feature attention network and 4 residual error network blocks jointly form a group of integral residual error relations. Input SR _Coarss After general features are extracted from general convolutional layers, the general features are fused with corresponding semantic feature channels and enter a semantic feature attention network designed by people together. After two kinds of human face features input into the network are fused, a mixed feature containing attention is obtained through channel attention operation. The attention mechanism can effectively distinguish important human face features, and the influence of semantic features under general features in a reconstructed network is strengthened. The hybrid features then enter a residual network group deepening feature representation. After the enhancement of a residual error network group and a semantic feature attention network, mixed features enter an integrated convolution layer and are converted into a 3-channel image serving as a face reconstruction result SR finally output in a generation stage _Fins 。

The fine reconstruction process comprises the following steps:

Step A24, SR to be reconstructed _Fine Passes through SR discriminator D _SR Integrates the features and outputs a 16x16 size tensor whose values represent D _SR The closer to 1 the reconstruction quality of the corresponding region represents the better the reconstruction perception quality, and the SR antithetical loss function of the SR face is established therebyNumber L _adv-SR The following were used: l is _adv-SR ＝-σ[D _SR (SR _Fine log(D _SR (HR))+(1-D _SR (SR _Fine ))log(1-D _SR (HR))]。

As shown in fig. 5, the system of the semantic feature-based face super-resolution reconstruction method formed in the above steps includes a synthesis module 61, an amplification module 62, and an integration module 63, where the synthesis module 61 is connected to the integration module 63 through the amplification module 62, the synthesis module is used for synthesizing a low-resolution face image containing complex noise and fuzzy distribution from a face, the amplification module is used for roughly amplifying the low-resolution face image to a target resolution, and the integration module is used for integrating semantic features and general features of the image, so as to form a super-resolution face reconstruction result.

Under the combined action of the content loss function and the SR counter loss function, the final reconstruction result SR output by the generation stage _Fine Both in content and perceptual quality. In summary, the innovation points of the invention are as follows: the face super-resolution reconstruction method based on semantic features and aiming at the multi-degradation mode combines an unsupervised learning framework and semantic feature prior information, so that the generalization capability of an FSR network model under the multi-degradation mode is improved, and meanwhile the reconstructed SR face perception quality under the multi-degradation mode is enhanced.

The invention has the innovation points that the human face semantic features are introduced in the unsupervised learning mode, the generalization capability of the FSR network model in reconstruction in the multi-degradation mode is enhanced, and the semantic features introduced in the unsupervised method as prior information are emphasized to be stable and excellent in performance in the multi-degradation mode. The invention has an unsupervised face super-resolution reconstruction network to enhance the generalization, adapts to the face reconstruction requirement under multiple degradation modes, and introduces face semantic features as prior to improve the visual perception quality.

The invention realizes the face super-resolution reconstruction task under the multi-degradation mode by utilizing an unsupervised network model and introducing the face semantic features as prior information, and gets rid of the limitation that a pair of LR-HR face data sets are necessarily used in a supervised method. The network model obtained by training under the framework of the invention can adapt to the requirements of face super-resolution reconstruction under various degradation conditions, and improves the visual perception effect of the reconstructed face. The network model can be suitable for assisting in video security detection, small-size face recognition, face verification and other related advanced visual tasks in a real environment.

It should be noted that the above list is only one specific embodiment of the present invention. It is clear that the invention is not limited to the embodiments described above, but that many variations are possible, all of which can be derived or suggested directly from the disclosure of the invention by a person skilled in the art, and are considered to be within the scope of the invention.

Claims

1. A face super-resolution reconstruction method based on semantic features is characterized by comprising the following steps:

step two, a generation stage: the generation stage comprises a coarse reconstruction process and a fine reconstruction process, the coarse reconstruction process is used for roughly amplifying the low-resolution face image to the target resolution, the fine reconstruction process is used for extracting semantic features based on the result of the coarse reconstruction process, then the semantic features and the general features are integrated under a semantic attention module, and finally a super-resolution face reconstruction result is formed and used as the output of the network.

2. The semantic feature-based face super-resolution reconstruction method according to claim 1, wherein the quality degradation stage comprises the following steps:

s11, the high resolution face image HR in the training set firstly passes through a residual error network group, the residual error network group comprises a plurality of residual error network blocks and a plurality of pooling layers, the pooling layers are positioned between two adjacent residual error network blocks, the pooling layers have a down-sampling function, and when the high resolution face image HR is down-sampled to 4x4, the image is up-sampled by a structure formed by a sub-pixel layer and related residual error network blocksThe image is restored to the size of 16x16 and finally integrated into a three-channel image as a degraded low-resolution face image LR _deg ；

L _adv-LR ＝-σ[D _LR (LR _deg log(D _LR (LR _real ))+(1-D _LR (LR _deg ))log(1-D _LR (LR _real ))]

3. the semantic-feature-based face super-resolution reconstruction method according to claim 2, wherein the residual network block comprises a convolutional layer A (1), a convolutional layer B (2), a regularization layer C (3), an activation layer D (4), a convolutional layer E (5), a regularization layer F (6) and an activation layer G (7), the convolutional layer A (1) is connected with the regularization layer C (3) through the convolutional layer B (2), the regularization layer C (3) is connected with the convolutional layer E (5) through the activation layer D (4), and the convolutional layer E (5) is connected with the activation layer G (7) through the regularization layer F (6).

4. The semantic feature-based face super-resolution reconstruction method of claim 2, wherein the LR discriminator D _LR The spectrum regularization layer comprises an activation layer H (8), a convolution layer I (9), a spectrum regularization layer G (10), an activation layer K (11), a convolution layer L (12), a spectrum regularization layer M (13) and an activation layer N (14), wherein the activation layer H (8) is connected with the spectrum regularization layer G (10) through the convolution layer I (9), and the spectrum regularization layer G (10) is connected with the convolution layer L (12) through the activation layer K (11)The convolutional layer L (12) is connected with an activation layer N (14) through a spectrum regularization layer M (13).

5. The semantic feature-based face super-resolution reconstruction method of claim 1, wherein the coarse reconstruction process is LR _deg SR face SR with rough reconstruction under target resolution output on rough reconstruction network _Coarss And establishing a related content loss function L _pix-CoarssSR ：L _pix-CoarssSR ＝||SR _Coarss -HR|| ₂ Then based on SR _Coarss Designing a semantic loss function L _seg-CoarssSR ：L _seg-CoarseSR ＝||Seg(SR _Coarse )-Seg(HR)|| ₂ (ii) a The Seg (-) is a pre-trained human face semantic segmentation network, the network can output human face semantic segmentation prediction results under 19 channels, and the 19 channels respectively correspond to different human face components including eyes, a nose, eyebrows, hair and a head.

6. The method for reconstructing super-resolution human face based on semantic features of claim 5, wherein in the fine reconstruction process, the general features and semantic features of the rough reconstructed human face are integrated and enhanced through a fine reconstruction network, and a human face super-resolution reconstruction result is finally output.

7. The super-resolution face reconstruction method based on semantic features of claim 6, wherein the fine reconstruction network is composed of a convolutional layer M (15), a residual network group (16), a semantic feature attention network (17) and an integrated convolutional layer (18).

8. The semantic feature-based face super-resolution reconstruction method according to claim 7, wherein the fine reconstruction process comprises the following steps:

Step A23, generating the SR of the phase output _Fins Establishing constraints, SR at first _Fins Establishing a content loss function L between standard images HR in the same training set _pix-FineSR ：L _pix-FineSR ＝||SR _Fine -HR|| ₂ ；

Step A24, SR to be reconstructed _Fins Passes through SR discriminator D _SR Integrates the features and outputs a 16x16 size tensor, whose values range from 0 to 1, representing D _SR The closer to 1 the reconstruction quality of the corresponding region represents the better the reconstruction perception quality, and the SR confrontation loss function L of the SR face is established by the closer to 1 _adv-SR The following were used:

L _adv-SR ＝-σ[D _SR (SR _Fins log(D _SR (HR))+(1-D _SR (SR _Fine ))log(1-D _SR (HR))]。

9. the system of the semantic feature based face super-resolution reconstruction method according to claim 1, wherein the system comprises a synthesis module (61), an amplification module (62), and an integration module (63), the synthesis module (61) is connected to the integration module (63) through the amplification module (62), the synthesis module is used for face synthesis to obtain a low-resolution face image containing complex noise and fuzzy distribution, the amplification module is used for roughly amplifying the low-resolution face image to a target resolution, and the integration module is used for integrating semantic features and general features of the image, so as to form a super-resolution face reconstruction result.