CN116523985B

CN116523985B - Structure and texture feature guided double-encoder image restoration method

Info

Publication number: CN116523985B
Application number: CN202310501736.6A
Authority: CN
Inventors: 张家骏; 廉敬; 刘津颖; 刘冀钊; 张怀堃; 董子龙; 郑礼; 汤春阳
Original assignee: Lanzhou Jiaotong University
Current assignee: Lanzhou Jiaotong University
Priority date: 2023-05-06
Filing date: 2023-05-06
Publication date: 2024-01-02
Anticipated expiration: 2043-05-06
Also published as: CN116523985A

Abstract

The invention discloses a double-encoder image restoration method guided by structural and texture features, belongs to the technical field of image restoration, and provides a double-encoder coarse restoration network guided by the structural features and the texture features and a fine restoration network based on a long-short period attention mechanism and a multi-scale receptive field, so that the combined restoration of the structure and the texture of a visual field defect image is realized.

Description

Structure and texture feature guided double-encoder image restoration method

Technical Field

The invention relates to the technical field of image restoration, in particular to a double-encoder image restoration method guided by structure and texture features.

Background

The purpose of the visual field defect image restoration is to restore the mask area in the digital image, fill the mask area with reasonable and vivid content and correct context semantics, restore the panorama and improve the picture texture. It is an important task in computer vision that can be used as an image editing tool to remove unwanted objects and restore defective images, and early image restoration methods have been mainly diffusion-based and block-based. Diffusion-based methods utilize thermal diffusion equations in physics to propagate information around the area to be repaired into the repair area through partial differential equations and variational principles. This method is only applicable to small scale defect repair in images. The block-based method comprises the steps of firstly selecting a pixel point from the boundary of a region to be repaired, taking the pixel point as a center, selecting a texture block with proper size according to the texture characteristics of an image, and then searching a texture matching block closest to the region to be repaired around the region to be repaired to replace the texture block.

However, when the key region and the important structure are defective, the method is not applicable any more, and with the continuous development of the deep learning technology, the Convolutional Neural Network (CNN) and the generated countermeasure network (GAN) based restoration method are widely applied, so that an effective tool is provided for image restoration. The existing image restoration method generally adopts a coder-decoder to extract the structure, texture and context semantics of the image, and then completes the reasonable restoration task on the visual sense of the defective image by means of generating an countermeasure network.

While existing methods can generate realistic and semantically trusted structures and textures of content within the mask region, typically either a single codec is used for repair, or two codecs are used for repair separately, ignoring the association between image structure and texture, which results in insufficient or unmatched structural expression of the texture of the image. The image generation process lacks guidance for joint extraction of image structures and texture features. Therefore, the invention provides a double-encoder coarse repair network guided by structural features and texture features and a fine repair network based on a long-short-period attention mechanism and a multi-scale receptive field, which realize the joint repair of the image structure and texture of the visual field defect.

Disclosure of Invention

The present invention aims to solve the above-mentioned problems, and to provide a structure and texture feature guided dual encoder image restoration method.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: the double-encoder image restoration method comprises a coarse restoration network and a fine restoration network, wherein the implementation steps of the coarse restoration network and the fine restoration network are as follows:

s1: the defect image to be repaired and the binary mask (mask=1) image are taken as the input of a structural encoder and a texture encoder together, and after the two encoders extract image features layer by layer, the structural distribution feature data and the texture distribution feature data are fitted by two Gaussian distribution functions N (0, 1) which are input externally.

S2: the structural feature space and the texture feature space are mapped to the potential space by the cross-semantic attention module, and the decoder recovers the image mask region from the potential space random samples.

S3: in the process of feature extraction, pyramid fusion is carried out on features extracted by the structural encoder and the texture encoder, and the fused feature space is used for guiding image restoration of a decoder to obtain a rough restoration result.

S4: using fused images I _fuse And mask image M together as input to the fine repair network.

S5: removing artifacts of a fused image mask region of a thin repair network by using a residual gate convolution network, and extracting image characteristic information;

s6: 3 different receptive fields are designed, useful local features and structural features are automatically filtered, interference of useless detail features in an image is effectively reduced, and the sizes of the 3 different receptive fields are respectively 3×3, 5×5 and 7×7;

s7: and a long-short term attention module is added to solve the problems of fuzzy areas and inconsistent context semantics in the image. In the long-period attention module, the attention weight matrix can be used for linking the decoding characteristic control space context, acquiring the coding characteristic of the fine repair network and completing the repair of the mask area by combining the network decoding characteristic and the coding characteristic.

S8: the decoder captures remote features by linking remote spatial contexts, maintains global semantic consistency of the image, selects finer granularity features and valid semantic features of the encoded features according to local features of the repair image, and gradually reconstructs the image with short-term and long-term attention scores to obtain a fine repair image with high resolution characteristics.

Further, the fitting of the two distributions in S1 adopts a Kullback-Leibler Divergence (KL) divergence;

the KL divergence is used for regularizing the learned importance sampling function, and the sampling function is constrained on a potential prior;

the potential a priori distribution is defined as a gaussian distribution, where the KL regularization of the structural and texture encoders is as follows:

wherein I is _m Representing a damaged image; z represents the potential space, which is the compressed data space corresponding to the structural features and the texture features, and the similar data points are smaller in distance in the potential space; q _ψ Andimportance sampling functions of image structure distribution and texture distribution respectively; n (0,I) represents a Gaussian distribution; l (L) ^S _KL A KL divergence loss function representing a structural feature; l (L) ^T _KL KL divergence loss function representing texture features.

Further, the cross-semantic attention module in S2 is placed after the dual encoder module, the structural encoder feature space F _S And texture encoder feature space F _T The two feature spaces are mapped to potential spaces by a 1x1 convolution filter. The cross-semantic attention module calculates the attention of the two feature spaces to obtain the attention scores of the two feature spaces.

Wherein the method comprises the steps of

In the formulas (3) - (4), β _j,i Indicating the extent to which the model focuses on the ith location when synthesizing the jth region. N represents the number of roughly restored image pixels; s is(s) _ij Is Q in the cross-attention module ^T And K multiplied by each other; q (F) _T ) Representing image texture features; k (F) _s ) Representing the image structural features. The calculation formula of the output O of the cross semantic attention module can be finally obtained as follows:

O＝αF _ST +F _S (6)

wherein the method comprises the steps of

V(F _S )＝W _v F _S (7)

In the formulas (5) - (7), F _ST Represents an attention score; v (F) _S ) Representing structural features of the image to be calculated, W _v Is a 1x1 convolution filter, alpha is a balance F _ST And F _S The initial value of the learnable scale parameter of the weight is set to 0; the cross-semantic attention network begins with learning the correlation of structural and textural features and finally extends to learning the interdependence and association of structural and textural features from feature maps.

Further, the coarse restoration result of the image in S3 is reconstructed pixel by using a Mean Absolute Error (MAE) distance:

in the formulas (8) - (9), I ^C _out Representing a rough restoration result of the image; i _g Representing a gold standard image (group Truth image); m represents a binary mask image; l (L) ^C _hole A reconstruction loss function representing a defective image mask region; l (L) ^C _valid Representing the reconstruction loss function of the non-masked areas of the defective image. Thus, pixel-by-pixel reconstruction loss L ^C _r The method comprises the following steps:

in the formula (10), lambda _rec To reconstruct the loss balance factor, this value is set to 20. Furthermore, for the coarse repair network in fig. 1, the LSGAN method is adopted ^[9] Compared with the traditional GAN loss function, the method can enable the network training to be more stable, and the generated image is more natural, and is defined as follows:

in the formulas (11) - (12), D represents a discriminator of the GAN network; l (L) _D Representing a countering loss function of the GAN network arbiter; e (E) _{Ig～pdata(Ig)} A probability density function representing a gold standard image; l (L) _G Representing a countermeasures loss function of the GAN network generator;representing the probability density function of the coarsely reconstructed image.

In summary, the total loss of the coarse repair network is defined as:

further, the fused image I in S4 _fuse The formula of (c) is defined as follows:

I _fuse ＝I _{out_m} +(1-M)*I _g (14)

in equation (14), the image I is fused _fuse Mask area I for coarsely restored image _{out_m} ＝M×I ^C _out Sum-gold standard image I _g Is included in the image data.

Further, the attention weight matrix beta in S7 _j,i The calculation formula of (2) is as follows:

wherein the method comprises the steps of

s _ij ＝Q(f _di ) ^T K(f _dj ) (16)

In the formulas (15) - (16), β _j,i Indicating the extent to which the model focuses on the ith location when synthesizing the jth region. N represents the number of pixels of the fine repair image; f (f) _dj Representing decoding characteristics; s is(s) _ij Is long and isShort term attention module Q ^T And K multiplied by each other; k (f) _dj ) Representing the input information corresponding to the decoding characteristics. Q (f) _di ) ^T Representing query vectors corresponding to decoded features

Q(f _di ) ^T ＝(f _di ) ^T ＝W _q f _di (17)

In the formula (17), W _q Is a 1x1 convolution filter. The self-attention layer formula of the long-short period attention module is expressed as follows:

V _D (f _dj ) Representing input information corresponding to decoding characteristics to be calculated; in order to realize the combination of the fine granularity characteristic of the encoder and the characteristic of the decoder, the encoder layer and the decoder layer in the global refinement network are connected by adopting jump connection, and the characteristic of the encoder layer and the attention weight matrix beta are used for realizing the combination of the fine granularity characteristic of the encoder and the characteristic of the decoder _j,i Score calculation of (2) to obtain remote spatial context characteristics, long-short distance attention layer output F _out The calculation formula of (2) is as follows:

in the formula (19), V _E (f _ei ) Representing input information corresponding to the coding feature to be calculated. The calculation formula of the output O of the whole long-short period attention module is as follows:

O＝γ(1-M)F _out +Mf _e (20)

wherein f _e Coding features representing a remote space; m represents a binary mask; gamma is equilibrium F _out And f _e A learnable scale parameter of the weights in between.

Further, setting the first training target of the refinement network in the step S7 as a reconstruction loss L ^R _r As with the reconstruction loss setting in the coarse repair network, the MAE was used for pixel-by-pixel reconstruction:

in the formulas (21) - (22), I ^R _out Representing a thin repair result of the image; l (L) ^R _hole A reconstruction loss function representing a fused image mask region; l (L) ^R _valid Representing the reconstruction loss function of the non-masked regions of the fused image. The invention also adds in the perception loss ^[10] And loss of style ^[11] And performing feature extraction on the image by using the trained VGG-16 network, and calculating the loss of the two in the spatial features. Perception loss L ^R _per The definition is as follows:

in formula (23), F _i And (5) representing the i-th layer characteristic diagram in the pretrained VGG-16 network. Style loss L ^R _syle The definition is as follows:

wherein G is _i A gram matrix is represented, representing the covariance matrix between features and the correlation between each feature. In summary, the overall loss L of the global refinement network _R The method comprises the following steps:

wherein lambda is _rec 、λ _p And lambda (lambda) _s Are balance factors.

Compared with the prior art, the invention has the following beneficial effects:

(1) The dual encoder coarsely repairs a model framework of network extraction structural features and texture features;

(2) A method and a technical route for a dual-encoder coarse restoration network to guide a decoder to carry out image restoration;

(3) A fine repair network model architecture based on a long-short-term attention mechanism and a multi-scale receptive field is connected with algorithm parameter setting of a remote space context.

Drawings

FIG. 1 is a flow chart of a dual encoder generated image restoration method of the present invention;

FIG. 2 is a schematic diagram of a cross-semantic attention module of the present invention;

FIG. 3 is a schematic diagram of a long-short term attention module according to the present invention;

fig. 4 is a visual effect contrast chart of six image restoration methods of the present invention.

Detailed Description

The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

The technical scheme of the invention is that a flow chart of a double-encoder image restoration method is shown in figure 1. The dual encoder image restoration method includes two steps: a coarse repair network implementation step and a fine repair network implementation step. The training targets of the coarse restoration network comprise image data characteristic distribution regularization, image reconstruction loss and network countermeasure loss.

The rough repair network comprises the following steps:

(1) the defect image to be repaired and the binary mask (mask=1) image are taken as the input of a structural encoder and a texture encoder together, and after the two encoders extract image features layer by layer, the structural distribution feature data and the texture distribution feature data are fitted by two Gaussian distribution functions N (0, 1) which are input externally. The fit of both distributions uses a Kullback-Leibler Divergence (KL) divergence. The KL divergence is used to regularize the learned importance sampling function, constraining the sampling function to a potential prior. The potential a priori distribution is defined as a gaussian distribution, where the KL regularization of the structural and texture encoders is as follows:

(2) The structural feature space and the texture feature space are mapped to the potential space by a cross-semantic attention module (cross-semantic attention module is shown in fig. 2) from which the encoder randomly samples the restored image mask region.

In FIG. 2, the cross-semantic attention module is placed after the dual encoder module, structural encoder feature space F _S And texture encoder feature space F _T The two feature spaces are mapped to potential spaces by a 1x1 convolution filter. The cross-semantic attention module calculates the attention of the two feature spaces to obtain the attention scores of the two feature spaces.

Wherein the method comprises the steps of

s _ij ＝Q(F _T ) ^T K(F _S ) (4)

O＝αF _ST +F _S (6)

wherein the method comprises the steps of

V(F _S )＝W _v F _S (7)

(3) In the process of extracting the characteristics, the characteristics extracted by the structure encoder and the texture encoder are subjected to gold word

Tower type fusion, wherein the fused feature space is used for guiding the image restoration of the decoder to obtain a rough restoration result. For the rough restoration result of the image, the invention adopts the Mean Absolute Error (MAE) distance to reconstruct pixel by pixel:

In summary, the total loss of the coarse repair network is defined as:

in formula (13), lambda _KL Represents the KL divergence loss balance factor, which is set to 20. After the rough repair process is completed, rough repair result I ^C _out The masking region of the image can be restored because of the design of gate convolution in the coarse restoration network, artifacts caused by the masking region are eliminated, but two problems still exist:

(1) the image mask area is still blurred after being repaired;

(2) the content of the image after the completion is consistent with the whole lack of semantics, and the context semantics are inconsistent.

In order to solve the problems, the invention designs a global fine restoration network, which adopts a multi-scale feature extraction mode and a long-period attention module to eliminate a fuzzy region in an image, unifies global semantics, and improves the resolution of an image mask region and the consistency of the global semantics.

The implementation steps of the double-encoder image fine restoration adopt the following algorithm:

(1) using fused images I _fuse And mask image M together as input to the fine repair network. Fusion image I _fuse The formula of (c) is defined as follows:

I _fuse ＝I _{out_m} +(1-M)*I _g (14)

(2) Removing artifacts of the fused image mask region of the fine repair network using a residual gate convolution network, extracting image feature information (in fig. 1, the residual gate convolution network is represented by blue rectangular blocks);

(3) 3 different receptive fields are designed, useful local features and structural features are automatically filtered, interference of useless detail features in an image is effectively reduced, and the sizes of the 3 different receptive fields are respectively 3×3, 5×5 and 7×7;

(4) adding a long-short term attention module to solve the problem of blurred areas and up-down in the imageText semantic inconsistency problem (long and short term attention module red rectangular block is shown in fig. 1, long and short term attention module is shown in fig. 3). In the long-period attention module, the attention weight matrix can be used for linking the decoding characteristic control space context, acquiring the coding characteristic of the fine repair network and completing the repair of the mask area by combining the network decoding characteristic and the coding characteristic. Attention weight matrix beta _j,i The calculation formula of (2) is as follows:

wherein the method comprises the steps of

s _ij ＝Q(f _di ) ^T K(f _dj ) (16)

In the formulas (15) - (16), β _j,i Indicating the extent to which the model focuses on the ith location when synthesizing the jth region. N represents the number of pixels of the fine repair image; f (f) _dj Representing decoding characteristics; s is(s) _ij Is Q in the long-short period attention module ^T And K multiplied by each other; k (f) _dj ) Representing the input information corresponding to the decoding characteristics. Q (f) _di ) ^T Representing query vectors corresponding to decoded features

Q(f _di ) ^T ＝(f _di ) ^T ＝W _q f _di (17)

V _D (f _dj ) Representing input information corresponding to decoding characteristics to be calculated; in order to realize the combination of the fine granularity characteristic of the encoder and the characteristic of the decoder, the encoder layer and the decoder layer in the global refinement network are connected by adopting jump connection, and the characteristic of the encoder layer and the attention weight matrix beta are used for realizing the combination of the fine granularity characteristic of the encoder and the characteristic of the decoder _j,i Score calculation of (a) to obtain remote space context characteristics, long and short distancesAttention layer output F _out The calculation formula of (2) is as follows:

O＝γ(1-M)F _out +Mf _e (20)

wherein f _e Coding features representing the far space (represented in fig. 3 by orange matrix blocks); m represents a binary mask; gamma is equilibrium F _out And f _e A learnable scale parameter of the weights in between.

(5) The decoder captures the remote features by linking the remote spatial contexts (in the fine repair network of fig. 1, the context links are connected with orange solid lines), maintains global semantic consistency of the image, selects finer granularity features and valid semantic features of the decoded features according to the environment of the repair image, and gradually reconstructs the image with short-term and long-term attention scores, obtaining a fine repair image with high resolution characteristics.

The first training goal of the refinement network is to reconstruct the loss L ^R _r As with the reconstruction loss setting in the coarse repair network, the MAE was used for pixel-by-pixel reconstruction:

in the formulas (21) - (22), I ^R _out Representing a thin repair result of the image; l (L) ^R _hole A reconstruction loss function representing a fused image mask region (as shown in fig. 1); l (L) ^R _valid Representing the re-masking of non-masked regions of a fused imageAnd constructing a loss function. The invention also adds in the perception loss ^[10] And loss of style ^[11] And performing feature extraction on the image by using the trained VGG-16 network, and calculating the loss of the two in the spatial features. Perception loss L ^R _per The definition is as follows:

wherein lambda is _rec 、λ _p And lambda (lambda) _s Are balance factors.

Comparing the experimental data results:

the color images used in the experiments of the present invention were all from the CelebA-HQ dataset ^[12] . The high resolution dataset of the CelebA-HQ dataset contained 30000 face images, randomly selecting 27000 images for training and 300 images for testing.

The superior performance of the method of the present invention was verified by comparison with five other representative algorithms. The five comparison algorithms comprise GC algorithm, PIC algorithm, MEDFE algorithm, RFR algorithm and MADF algorithm. For the image quality evaluation index, several common indexes in the image restoration task are adopted: l1 error, peak signal-to-noise ratio (PSNR), structural Similarity (SSIM), fuv Lei Xiete distance (FID), and learning perceived image block similarity (LPIPS). The experimental results of the CelebA-HQ dataset are shown in Table 1 (best evaluation results are shown in bold font, second best evaluation results are shown in underline).

Table 1: celebA-HQ dataset experimental result comparison chart

As can be seen from the color image restoration results in Table 1, in the restoration process of which the mask area is the image center area, the evaluation results of the evaluation index L1 and the LPIPS are the best in six algorithms, which indicates that the error between the experimental image obtained by the method and the pixel value of the gold standard image (group Truth image) is the smallest in all the experimental methods, and the image restoration effect is the best. The evaluation results of the evaluation indexes PSNR and SSIM of the method are inferior to those obtained by the comparison algorithm GC in all experimental methods, which shows that the experimental images obtained by the method have stronger consistency with the gold standard images (group Truth images). The evaluation results of the evaluation index FID of the method are ranked second in all comparison algorithms, which shows that the experimental image obtained by the method has stronger correlation with the gold standard image.

In the restoration process of the mask area being the image random area, the evaluation results of the method evaluation indexes L1, SSIM and FID are the best of six algorithms, the error between the pixel values of the experimental image obtained by the method and the pixel values of the gold standard image (group trunk image) is the smallest in all the experimental methods, and the consistency and the relevance are the strongest, which indicates that the comprehensive restoration effect of the image is the best.

Fig. 4 shows a visual effect comparison of six image restoration methods. The experimental gallery is the CelebA-HQ dataset. Wherein the first line of images is the experimental result that the mask area is the central area of the image. The second line of images is the experimental result of the mask area being a random area. Further, the first column image is a gold standard image (group Truth image); the second column of images are defect images; the third column of images are repair images obtained by the MEDFE algorithm; the fourth column of images are repair images obtained by a GC algorithm; the fifth column of images are repair images obtained by an MADF algorithm; the sixth column of images are repair images obtained by an RFR algorithm; the seventh column of images are repair images obtained by PIC algorithm; the eighth column of images is the repair image obtained by the method.

In CelebA-HQ dataset experiments, MADF and the method of the invention can accurately repair and display the characteristics of human face eyes, nose, mouth, hair and the like. For large-area mask areas, the filling effect of other algorithms is poor, and the filling effect is mainly represented by facial feature blurring and rough texture. Compared with the repairing image obtained by the method, the facial features of the person are clearer and more natural, and the visual effect of human eyes is better.

The experimental environment used pyrach 1.8.0,python 3.6.13,GPU was NVIDIA GeForce RTX 3090. The experimental network contained 14M trainable parameters using Orthogonal Initialization and Adam optimization algorithms. The fixed learning rate of the network training is gamma=10 ^-4 . The balance factor is empirically set to lambda _rec ＝20、λ _kl ＝20、λ _p ＝0.05、λs＝100。

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. The double-encoder image restoration method guided by the structure and the texture features is characterized by comprising a coarse restoration network and a fine restoration network, wherein the implementation steps of the coarse restoration network and the fine restoration network are as follows:

s1: the defect image to be repaired and the binary mask image are taken as the input of a structure encoder and a texture encoder together, after the two encoders extract image features layer by layer, two Gaussian distribution functions N (0, 1) input from the outside fit the structure distribution feature data and the texture distribution feature data;

s2: the structural feature space and the texture feature space are mapped to the potential space through a cross semantic attention module, and the encoder restores an image mask area from the potential space random sampling;

s3: in the process of feature extraction, pyramid fusion is carried out on features extracted by a structural encoder and a texture encoder, and the fused feature space is used for guiding image restoration of a decoder to obtain a rough restoration result;

s4: using fused images I _fuse And mask image M together as an input to the fine repair network;

s7: adding a long-term attention module and a short-term attention module, solving the problem of fuzzy area and inconsistent context semantics in the image, wherein in the long-term attention module, an attention weight matrix can be used for linking with the context of a decoding feature control space, acquiring coding features of a fine repair network, combining network decoding features and coding features, and finishing the repair of a mask area;

the attention weight matrix beta in the S7 _j,i The calculation formula of (2) is as follows:

wherein the method comprises the steps of

s _ij ＝Q(f _di ) ^T K(f _dj ) (16)

In the formulas (15) - (16), β _j,i Representing the extent to which the model focuses on the ith location when synthesizing the jth region; n represents the number of pixels of the fine repair image; f (f) _dj Representing decoding characteristics; s is(s) _ij Is Q in the long-short period attention module ^T And K multiplied by each other; k (f) _dj ) Input information corresponding to the decoding characteristics is represented; q (f) _di ) ^T Representing query vectors corresponding to decoded features

Q(f _di ) ^T ＝(f _di ) ^T ＝W _q f _di (17)

In the formula (17), W _q Is a 1x1 convolution filter, and the self-attention layer formula of the short-period attention module is expressed as follows:

in the formula (19), V _E (f _ei ) The input information corresponding to the coding feature to be calculated is represented, and the calculation formula of the output O of the whole long-short-period attention module is as follows:

O＝γ(1-M)F _out +Mf _e (20)

wherein f _e Coding features representing a remote space; m represents a binary mask; gamma is equilibrium F _out And f _e A learnable scale parameter of the inter-weight;

s8: the decoder captures remote features by linking remote spatial contexts, maintains global semantic consistency of the image, selects finer granularity features and valid semantic features of the decoded features according to the environment of the repair image, and gradually reconstructs the image with short-term and long-term attention scores, obtaining a fine repair image with high resolution characteristics.

2. A structure and texture feature guided dual encoder image restoration method according to claim 1, wherein the fitting of the two distributions in S1 uses Kullback-Leibler Divergence (KL) divergence;

wherein I is _m Representing a damaged image; z represents potential space, itThe compressed data space corresponding to the structural features and the texture features is formed, and the distance between similar data points in the potential space is smaller; q _ψ Andimportance sampling functions of image structure distribution and texture distribution respectively; n (0,I) represents a Gaussian distribution; l (L) ^S _KL A KL divergence loss function representing a structural feature; l (L) ^T _KL KL divergence loss function representing texture features.

3. A structure and texture feature guided dual encoder image restoration method according to claim 1, characterized in that the cross semantic attention module in S2 is placed after the dual encoder module, the structure encoder feature space F _S And texture encoder feature space F _T Mapping the two feature spaces to potential spaces by a convolution filter of 1x 1; the cross semantic attention module evaluates the attention of the two feature spaces to obtain the attention scores of the two feature spaces:

wherein the method comprises the steps of

s _ij ＝Q(F _T ) ^T K(F _S ) (4)

In the formulas (3) - (4), β _j,i Representing the degree to which the model focuses on the ith location in synthesizing the jth region, N representing the number of roughly restored image pixels; s is(s) _ij Is Q in the cross-attention module ^T And K multiplied by each other; q (F) _T ) Representing image texture features; k (F) _s ) The calculation formula for representing the image structural characteristics and finally obtaining the output O of the cross semantic attention module is as follows:

O＝αF _ST +F _S (6)

wherein the method comprises the steps of

V(F _S )＝W _v F _S (7)

4. A structure and texture feature guided dual encoder image restoration method according to claim 1, characterized in that the coarse restoration of the image in S3 is a pixel-by-pixel reconstruction using Mean Absolute Error (MAE) distance:

in the formulas (8) - (9), I ^C _out Representing a rough restoration result of the image; i _g Representing a gold standard image; m represents a binary mask image; l (L) ^C _hole A reconstruction loss function representing a defective image mask region; l (L) ^C _valid Reconstruction loss function representing non-masked areas of a defective image, therefore, pixel-by-pixel reconstruction loss L ^C _r The method comprises the following steps:

in the formula (10), lambda _rec To reconstruct the loss balance factor, the value is set to 20, and in addition, for a coarse repair network, the LSGAN method is used to set the loss function, and compared with the traditional GAN loss function, the method can make network training more stable, and the generated image is more natural, and is defined as follows:

in the formulas (11) - (12), D represents a discriminator of the GAN network; l (L) _D Representing a countering loss function of the GAN network arbiter; e (E) _{Ig～pdata(Ig)} A probability density function representing a gold standard image; l (L) _G Representing a countermeasures loss function of the GAN network generator;a probability density function representing the coarsely reconstructed image;

in summary, the total loss of the coarse repair network is defined as:

5. a structure and texture feature guided dual encoder image restoration method according to claim 1, wherein said fused image I in S4 _fuse The formula of (c) is defined as follows:

I _fuse ＝I _{out_m} +(1-M)*I _g (14)

6. A structure and texture feature guided dual encoder image restoration method according to claim 1, characterized in that the first training goal of the refinement network in S7 is set to reconstruct the loss L ^R _r As with the reconstruction loss setting in the coarse repair network, the MAE was used for pixel-by-pixel reconstruction:

in the formulas (21) - (22), I ^R _out Representing a thin repair result of the image; l (L) ^R _hole A reconstruction loss function representing a fused image mask region; l (L) ^R _valid Representing a reconstruction loss function of a non-mask region of the fused image, adding perception loss and style loss, extracting features of the image by using a trained VGG-16 network, and calculating the loss of the two features in a spatial feature; perception loss L ^R _per The definition is as follows:

in formula (23), F _i Representing an ith layer characteristic diagram in a pre-trained VGG-16 network; style loss L ^R _syle The definition is as follows:

wherein G is _i Representing a gram matrix representing a covariance matrix between features and correlations between each feature; in combination with the above-mentioned,total loss L of global refinement network _R The method comprises the following steps:

wherein lambda is _rec 、λ _p And lambda (lambda) _s Are balance factors.