CN115035170B

CN115035170B - Image restoration method based on global texture and structure

Info

Publication number: CN115035170B
Application number: CN202210535815.4A
Authority: CN
Inventors: 王杨; 刘海鹏; 汪萌
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2024-03-05
Anticipated expiration: 2042-05-17
Also published as: CN115035170A

Abstract

The invention discloses an image restoration method based on global texture and structure, which relates to the field of image processing and comprises the following steps: inputting an image to be repaired, and obtaining a texture reference set of the image to be repaired; filling a subsequent shielding block by using the known region and the shielding block which is already coarsely filled as conditions, and placing the new shielding block into the conditions after coarsely filling, so as to continuously help the subsequent filling, wherein the method specifically comprises the following steps of: selecting a reference vector from the texture reference set, repairing the rough filling block, and calculating the attention score between the texture reference set and the rough filling block; and reconstructing rough filling blocks by using the bridging attention module and the attention score, obtaining a corpus set after multi-layer construction, and selecting the candidate block with the strongest association from the corpus set to obtain final repair output. The repair output obtained by the method is more coherent semantically.

Description

Image restoration method based on global texture and structure

Technical Field

The invention relates to the field of image processing, in particular to an image restoration method based on global texture and structure.

Background

Image restoration is a technology for restoring an occlusion region of an image, and supports various applications such as image editing and restoration. The open diffusion-based and patch-based methods can only repair smaller-sized masking regions with simple pixel-level color information, failing to capture the high-level semantics of the repair region. To address this problem, a great deal of attention is paid to depth models in which Convolutional Neural Network (CNNs) based models follow encoder-decoder architecture to learn advanced semantic information. However, the local generalization of CNNs only receives padding information for bounded known regions within the local spatial extent of the masked region a priori.

To solve this problem, attention mechanism-based models have been proposed. In particular, the occlusion region, expressed in blocks, is first filled with coarse content as a query for all known small blocks in the image, and then a candidate block with a larger score is selected for replacement. It is noted that, the PENNet provides a cross-layer attention module, calculates attention scores of the deep feature images, performs block replacement on the bottom feature images according to the attention scores, and finally obtains the output of the repair result through up-sampling. Although it considers all known tiles in the entire image, each known tile is considered independently over the occlusion region, this strategy can mislead that the occlusion tile is embedded by only one known dominant tile with the largest attention score, resulting in an undesirable repair output.

Similar to the attention-based approach, the transducer-based model also considers information from all known regions. Instead of focusing on a patch pool, it is based on a pixel level where each pixel of the occlusion region is used as a query to excite pixels of a known region to be reconstructed and then projected further into a color vocabulary library to select the most relevant colors for repair. The repaired pixels are then added to the known pixel pool and the process is repeated until all pixels are repaired in a predefined order. Technically, BAT and ICT propose a decoder converter that captures a structure prior at the pixel level through a dense attention mechanism module and projects it into a visual color corpus, selecting the corresponding color. On the one hand, it explores all known areas, rather than just determining a limited known area, and is therefore superior to the attention model; on the other hand, the pixel level does not capture semantics as well as the patch level and is therefore inferior to the attention model. Furthermore, the attention score is obtained using only location information, far from the texture semantic level. Furthermore, the computation of a large number of pixels by the transducer model may lead to a secondary complexity of the computational burden due to the self-attention module.

Still further from a texture and structural standpoint, the above methods can be largely divided into two categories: one is a pure texture method, and the other is a structure-texture based method. Because of pure texture methods, such as CNNs-based and attention-based models, heavily rely on known texture information to recover masking regions, but ignoring structures may result in reasonable textures not being recovered; worse still, texture information for repair comes only from bounded known areas, not the entire image, and thus semantic correlation between textures in the global image cannot be captured well. In contrast, structure-texture based methods aim to generate better texture semantics for structure-constrained guided occlusion regions. Texture restoration is then performed over a different upsampling network. In summary, their core problem is how to fill the mask area with the structure information.

Edge connect restores the edge information to the structure information based on the edge map and the occlusion black-and-white map through CNNs. The repaired edge map is combined with the occluded real image containing texture information, and the occlusion region is restored through the codec model. The EII adopts the CNNs model to reconstruct the shielding area of the black-and-white image as a structural constraint, and on the basis, color information is propagated in the image as a texture stream through multi-scale learning. The MEDFE follows an encoder-decoder architecture, where the goal of the encoder is to equalize structural features from the deep layers of the CNNs and texture features from the shallow layers of the CNNs, through a channel and spatial equalization process, and then feed back as input to the decoder to generate the complement image. Although intuitively, structural information can be captured, information from all known blocks cannot be utilized, and therefore, is referred to as a "pseudo global structure," which may mislead non-ideal texture recovery compared to the transducer model. CTSDG has recently proposed that a structure and texture can be mutually guided by a two-stream structure based on U-Net variants. However, it may use local textures to guide the global structure, thereby creating blurring artifacts. On this basis, how to generate global texture and structure information, the semantics of the whole image can be well utilized, and how to match the two types of global information is very beneficial to image restoration.

Disclosure of Invention

In view of the above, the present invention provides an image restoration method based on global texture and structure to solve the problems existing in the background art.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

an image restoration method based on global texture and structure comprises the following steps:

inputting an image to be repaired, and obtaining a texture reference set of the image to be repaired;

filling a subsequent shielding block by taking a known area and the shielding block which is already coarsely filled as conditions, and placing the new shielding block into the conditions after coarsely filling, so as to continuously help the subsequent filling, wherein the method specifically comprises the following steps of:

selecting a reference vector from the texture reference set, repairing the rough filling block, and calculating the attention score between the texture reference set and the rough filling block;

and reconstructing rough filling blocks by using the bridging attention module and the attention score, obtaining a corpus set after multi-layer construction, and selecting the candidate block with the strongest association from the corpus set to obtain final repair output.

Optionally, the calculation formula of the bridging attention module is as follows:

wherein the method comprises the steps ofRepresenting a bridging attention module,>is a learnable linear mapping matrix, d _c ，d _r Is dimension (s)/(s)>Is a texture reference set; information->As query to the set of known blocks +.>Performing attention calculations, knowing the block set +.>Each value in (1) has the value +.>Attention calculations are performed so that the coarse structure information can be reconstructed in the end>

Optionally, the attention score calculation formula is as follows:

wherein,is to directly calculate +.>And->A point of attention between; />Is a learnable linear mapping matrix, d _i Is a dimension.

Optionally, the candidate block association probability is calculated as follows:

wherein O is _t-1 The representation of the known region is made,is a texture reference set; />Direct calculation for the Mth layer +.>And->Attention score between->Use of bridging attention module for the Mth layer->And->The attention score between the two points of interest, the lambda is the weight of the sample, I.I ₁ Is obtained by adding all attention scores related to the texture reference, thereby helping to reconstruct +.>Obtaining the correspondingN _C Is the number of elements in the list. Select the most relevant candidate->As a result of round t->By selecting at +.>Upper utilization I.I ₁ The sum of the calculated maximum attention scores.

Optionally, the calculation formula of the coarse filling block is as follows:

wherein d _m Is the dimension of the film,and->Is a learnable linear mapping matrix; by means of the attention mechanism using the non-occluded block and the remaining coarsely filled blocks +.>The formed set P _k-1 Finally adding the rough filled block ++>To further compose a set P _k 。

Optionally, the encoder structure based on the transducer acquires a texture reference set of the image to be repaired, wherein the encoder structure of the transducer comprises N layers, each layer having a multi-headed self-attention MSA and a feed forward network FFN.

Optionally, for the encoder layer i of the transducer there is:

wherein the method comprises the steps ofInput representing layer I, < >>Representing the intermediate result of layer i +.>Representing input representing layer l+1, LN (·) representing layer normalization, FFN (·) consisting of two fully connected layers, each layer in turn consisting of two sublayers; the procedure reconstructs each r for MSA () _T The global semantic association is captured by a multi-headed self-attention module, which then converts the two fully connected layers into an input of the l+1 layer directing the end of the last layer.

Optionally, the formula for calculating the multi-head attention mechanism of the first layer is as follows:

where h is the number of multiple heads, d _l Is the dimension of the film,is 3 mapping matrixes which can be learned, j is more than or equal to 1 and less than or equal to h, W ^l Representing a learnable fully connected layer, fusing outputs from different heads; after passing through the encoder layer, each texture feature vector r is reconstructed _T Is a reference vector->Thereby pooling as texture reference set->

The overall loss function is also included and minimized to train the overall transducer model:

wherein,representing reconstruction loss->Indicating a loss of perception->Representing style loss, setting lambda _r ＝10；λ _p =0.1 and λ _s =250, and finally I 'is applied by an upsampling operation' _out Upsampling to the final result I _out 。

Compared with the prior art, the invention discloses an image restoration method based on global texture and structure, which has the following beneficial effects:

1. a transducer model is proposed comprising an encoder and a decoder, wherein the goal of the encoder module is to capture the semantic dependencies of the whole image in the texture reference, thereby obtaining a global texture reference set; a coarse filling attention module is designed, and all known image blocks are used for filling the masking area to obtain global structural information.

2. In order for the decoder to have the ability to combine the advantages of both the global texture reference and the structural information, a structure-texture matching attention module is configured on the decoder in an intuitive attention transfer fashion that dynamically creates an adaptive block vocabulary for blocks filled in occlusion regions through a probability diffusion process.

3. To reduce the computational burden, several training techniques are disclosed to overcome the memory overhead of gpu while achieving the most advanced performance in a typical benchmark test.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is an overall block diagram of the present invention;

FIG. 2 is a schematic view of a rough filled occlusion region of the present invention;

FIG. 3 is a diagram of the overall architecture of a transducer decoder according to the present invention;

FIG. 4 is a schematic diagram of a bridge module according to the present invention;

FIG. 5 is a diagram illustrating a bridged attention score delta update of the present invention

FIG. 6 is a comparative graph of the results of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention discloses an image restoration method based on global texture and structure, which adopts a transducer model matched with an encoder and a decoder to capture global semantic relativity of all blocks in the whole image from the texture, wherein the encoder encodes relativity of texture information of all blocks in the whole self-attention module, and each small block is extracted by CNNs as a point on a feature map to represent semantics. In this way, the texture information of each point is expressed as a texture reference vector (hereinafter referred to as texture reference) as a query for reconstructing all other texture references. In other words, each texture reference encodes a different attention score for semantic relevance to all other textures in the full map, thereby producing one global texture reference. The goal of the transform decoder is to render all the occlusion blocks with all texture references. To this end, this embodiment developed a coarse fill attention module that initially filled all of the shadow blocks by using all of the known blocks. The present embodiment is more prone to use all known patches in the image to obtain its global structure information than its inaccurate coarse texture information. In conjunction with all texture references having global semantic relevance, the present embodiment proposes a new structure-texture matching attention module consisting of all known tiles, where the structure information of each occlusion tile is used as a query to process all known tiles, and each known tile is used as a query to process all texture references. In this way, the best fit in both worlds, the transition approach, and the adaptive block vocabulary consisting of repaired tiles, can be utilized to progressively overlay all the occluded tiles through the probability diffusion process. The overall model is shown in fig. 1.

The method specifically comprises the following steps:

The purpose of image restoration is to input imagesAnd an occlusion picture M of the same size, (wherein the value of M is not 0, i.e. 1.) both use element multiplication to get an occlusion picture I _m ＝I _gt Repairing the paradox M to obtain a complete picture I _out Is a process of (2).

In order to capture the texture semantic relevance of the entire image, an explicit texture representation for each tile needs to be learned. In particular, advanced semantic feature graphs can be generated through a typical CNNs network, resNet50 Each point in the feature map is I _m Corresponding to a block of texture information of the original image. Obviously, if the feature map size is very large, such as 32×32, the shallow layer cannot capture the high-level semantics; if the network is deep, the feature map becomes 8 x 8 in size, meaning that each point of the feature map carries excessive semantics, resulting in the texture information of one block being mixed with the texture information of the other block. Thus, for balance, set +.> Let the dimension value of each feature point be C, i.e. the output dimension 2048 of ResNet50, then map it to a low-dimensional vector representation r _T Its dimension d _E 256, which is calculated by combining the feature maps of 2048 channels first, performing 256 convolution processes of 1*1, and recovering the feature map to +.>In the form of (a). To be able toTo preserve the spatial order information, a corresponding position embedding is added for each point on the feature map>Thereby forming a final input form E for the encoder _T 。

Ready to use E _T Calculating texture correlation throughout the picture by calculating self-attention. The transducer-based encoder structure includes N layers, each layer having a multi-headed self-attention (MSA) and a Feed Forward Network (FFN). For layer i there is:

wherein the method comprises the steps ofInput representing layer I, < >>Representing the intermediate result of layer i +.>Representing the input representing layer l+1, LN (·) represents layer normalization, FFN (·) consists of two Fully Connected (FC) layers, each consisting of two sub-layers again. The main process is MSA (& gt) reconstruction of each r _T The global semantic association is captured by a multi-headed self-attention module, which then converts the two fully connected layers into an input of the l+1 layer directing the end of the last layer. Wherein residual connections are used around each sub-layer. The formula for calculating the multi-head attention mechanism of the first layer is as follows:

where h is the number of multiple heads, d _l Is the dimension of the film,is 3 mapping matrixes which can be learned, j is more than or equal to 1 and less than or equal to h, W ^l Representing a learnable fully connected layer, fusing outputs from different heads. After passing through the encoder layer, each texture feature vector r can be reconstructed _T Is a reference vector->So that it can be assembled as a reference set +.>It is easy to see that each +.>All other global texture dependencies are encoded, where the texture dependencies are different at different locations.

In addition to fetching texture reference setsIt is also important to express how to repair occlusion tiles using these features. Unlike existing pixel-level decoding converters, the block size needs to be considered in order to facilitate semantics for a better match +.>Will I _m Downsampling to obtain a low-definition image, namely +.>So as to enhance the corresponding global structure information, and obtain the proper block size, and the I' _m Expansion into a 2D block sequence +.> Each block has a size of P, N ₀ Is the number of blocks, which are then flattened and mapped to d by a learnable linear mapping matrix _D The dimension is taken as the dimension of the block. Whether a small block is known or not, additional spatial locations are embedded +.>Is added to the expanded block to preserve spatial order.

In discussionBefore association with an occlusion region, it is necessary to obtain coarse information based on the known region to fill the occlusion region. Unlike previous areas that rely solely on local induction priors of CNNs to fill coarse content with known patches, a global fill attention mechanism is proposed to fill coarse content with all known blocks in an image. For ease of understanding, it is illustrated with FIG. 2 for the purpose of blocking the kth block m _k Coarse filling is performed, with the blocks not occluded and the first k-1 blocks being coarse filled. Specifically, first, all the occluded blocks are ordered in ascending order according to the occlusion ratio in order to fill the coarse content, they are reconstructed by the attention mechanism using the non-occluded blocks and the remaining coarse filled blocks +.> The formed set P _k-1 Finally adding the rough filled block ++>To further compose P _k . Rough filling block->I.e. at P _k-1 Reconstruction of m using attention mechanism _k The calculation formula is:

wherein d _m Is the dimension of the film,and->Is 3 learnable linear mapping matrices. Now discuss how to go from->Is selected to be proper->Repair each block->No. 1->Or->Is formed using the non-occluded areas of the entire image. Obviously, the _on>Can capture global texture information well and is more than +.>Is more accurate. However, repaired by all non-occluded blocks +.>The structure information of (c) is more excellent and the downsampling operation is further enhanced. Is stimulated by this, except for the direct use of +.>De-restructuring->It is also proposed to use the non-occluded blocks containing both the ideal texture and structure as a bridging module for better matching +.>And->Details are shown in fig. 4.

For an M-layer decoder, each layer contains two sub-layers, a Structure Texture Matching Attention (STMA) module and an FFN function containing 2 fully connected layers for converting the result of the attention mechanism into an input for the (l+1) th layer, as with the encoder, to the end of the M-th layer. The residual form is also sampled for concatenation. For the followingThe calculation mode at the first layer is as follows:

wherein STMA (-) represents the structural texture matching attention pass through including repaired blocks Known block set O of (1) _t-1 Get->And->Attention score between (a) and (b). The calculation formula of STMA (·) of the first layer is as follows

Wherein the method comprises the steps ofIs to directly calculate +.>And->Attention score between (a) and (b). />Is a learnable linear mapping matrix, d _i Is a dimension. As mentioned earlier, the coarse structure information +.>And->The direct matching results are not good, so it is proposed that the bridging attention module is based on non-occluded blocks, thereby indirectly exploiting +.>De-restructuring->The first layer is calculated as follows:

wherein the method comprises the steps ofRepresenting a bridging attention module,>is a learnable linear mapping matrix, d _c ，d _r Is a dimension. Wherein equation 6 implies a distraction operation. By, will->As inquiry go and->Performing attention calculation, and->Each value in (1) has the value +.>Attention calculations are performed so that finally +.>In this case, no reconstruction is used>Re-deconstructing->Since it is obvious that the known block is a very ideal real value, it does not need to be +.>The reconstruction is performed by combining equations 5 and 6:

wherein the method comprises the steps ofAnd->Is a learnable linear mapping matrix. Thus, a block-level decoder corpus is also needed to pick each repaired block. In particular, every rough filled patch +.>Reconstructing from formula 7 and changing to +.>Thereby converging to form corpus set m ^C Selecting the candidate block with the strongest association from the corpus +.>I.e., the final repair output with the highest probability of selection is calculated by equation 8.

Wherein, 256-dimensional vector output through last M-th layer, i-th item +.>And->Represents the z-th block rough filling block->And ith texture reference/>The attention scores calculated by equations 5 and 6, respectively. I.I ₁ All texture entries +.>Obtained by adding, thereby helping to reconstruct all +.>To get m ^c Corresponding->N _c Is m ^C The number of elements in the list. Select the most relevant candidate->As a result of round t-Yes->By selecting at +.>Upper utilization I.I ₁ The sum of the calculated maximum attention scores. I.e. for different +.>By->And (3) calculating the sum of the attention scores. And uses equation 8 to apply a probability diffusion process to the known block set O _t-1 Expansion to O _t Thereby further helping to pick candidate repair results for the t+1 round, ending with all regions being repaired. Decoder block corpus m ^C Is adaptively constructed based on the repair results of the coarse-filler blocks and is dynamically updated. The overall architecture of the transducer decoder is shown in FIG. 3。

Calculation efficiency: one may be concerned about the computational complexity caused by the attention module of each iteration. As shown in fig. 5, the attention score map between the coarse-filler block set, the known block set and the texture reference set is saved after calculation, demonstrating its effectiveness. When restoring a coarsely filled block to a repair block, it is not necessary to recalculate the attention score map between different sets, and only the row of the score map corresponding to the repair block, i.e. the coarsely filled block shown in fig. 5 (a) and (b), needs to be removed, and the attention score between the new repair block and the texture reference set is also supplemented in fig. 5 (c). In addition, after repairing the corpus m ^C After most candidate blocks in (3), the probability diffusion process in equation 8 need not be repeated, especially for those blocks that contain only little coarse filling, only the content through the surrounding area needs to be averaged in order to reduce the complexity of the attention mechanism forAnd m ^K 。

After all the blocked blocks are repaired, a reconstructed vector set is obtained, wherein each vector is in 384-dimensional feature space, and the vector set needs to be further restored into an RGB imageBased on the previous work, several typical loss functions are selected to measure the repair picture I' _out And downsampled real picture I' _gt Reconstruction errors in between. E.g. reconstruction lossesPerception loss->Style loss->And counter-loss->After this, use is made of the passthrough->Trained antagonistic neural networks to divide I' _out Upsampling to +.>Thus (S)>Will not appear in the training process of the transducer, and the satisfactory I 'is obtained through training' _out The process of (2) is as follows:

reconstruction loss: by using l ₁ Loss-measured downsampled true value I' _gt And model repair results I' _out Differences between pixels:

perceptual loss: to simulate human perception of image quality, a perception loss is calculated by defining a distance metric between an activation profile of a pre-trained network to a repair output and a true value, with:

wherein phi is _i Is obtained from the ith layer of VGG and has a size N _i ＝C _i *H _i *W _i Is phi of the characteristic diagram of (1) _i Representing the results of ReLu1_1, reLu2_1, reLu3_1, reLu4_1 and ReLu5_1.

Style loss: the activation profile described above in equation 10 is further used to calculate style loss to measure differences between covariance of the activation profile and mitigate "checkerboard" artifacts. Given the VGG jth layer activation profile, the style loss formula is as follows:

wherein the method comprises the steps ofIs a Gram matrix selected as the activation map.

Overall loss: based on the above, the overall loss function as shown in equation 12 can ultimately be obtained and minimized to train an overall transducer model:

in this embodiment, we set λ _r ＝10；λ _p =0.1 and λ _s =250, and finally I 'is applied by an upsampling operation' _out Upsampling to the final result I _out 。

The method proposed in this embodiment is implemented in Python and pythorch. Training with an AdamW optimizer; the learning rates of the transducer and the feature extractor are set to 10e-4 and 10e-5, respectively, and the weight decay is 10e-4. All Transformer weights were initialized with xaviririnit, with ResNet50 pre-trained by Imagenet in TORCHVISION, and batch normalization layers fixed. We also increase feature resolution by expanding the hole values of the final stage convolution and removing the step size from the first convolution of that stage. Both the transducer encoder and decoder include four layers. The network was trained using 256 x 256 sized images containing irregular occlusions, we performed experiments on three common datasets with different features: paris StreetView (PSV), celebA-HQ and Places2. We train the transducers using 2 NVIDIA 2080TI GPUs of batch size 32 for PSV and 4 NVIDIA 2080TI GPUs of batch size 64 for CelebA-HQ and Places2.

Quantitative evaluation of our proposed method and novel technique according to four evaluation indexes 1) L ₁ Error; 2) Peak signal to noise ratio (PSNR); 3) Structural Similarity Index (SSIM) and 4) FID. L (L) ₁ PSNR and SSIM forThe pixel-level low-level differences between the generated image and the true values are compared. FID evaluation perception results feature distribution distances between the generated image and the real image by measurement. The irregular masking areas in the image are verified at different scales across the image size.

Quantitative comparison: we compare our method with the latest: 1) CNNs pure texture method: GC and PIC; 2) Attention-based methods: hiFill; 3) Structure-texture based method: MEDFE, EC, CTSDG, EII and decoding converter based methods: ICT, BAT. As can be seen from Table 1, our process has a smaller L than the previous process ₁ Error and FID scores, greater PSNR and SSIM. In particular, a small FID score verifies the advantages of global texture referencing and structural feature representation. Since GC and PIC are texture-only filling methods, they fill in occluded areas only through known bounded areas. Whereas HiFill computes the similarity between each coarsely filled block and all known blocks independently, misleading the blocked block to be covered by only one explicit known region. Although the MEDFE intuitively captures the structural information, it fails to utilize the information of all known patches. Similar limitations apply to EC, CTSDG and EII. BAT and ICT recover masked regions at the pixel level according to existing principles, and do not capture global texture semantics well. Our method is therefore superior to other methods.

Qualitative comparison: to further clarify the observations, fig. 6 shows the visualization of all methods on three data sets. It can be seen that our approach yields a more semantically consistent repair output based on surrounding known regions.

User study: we further conducted user investigation studies on the datasets PSV, celebA-HQ and placs 2. Specifically, we randomly extracted 20 test images from each dataset, inviting a total of 10 volunteers to select the truest image from the proposed method and some of the most recent method-generated repair results. As shown in the last column of table 1, our method results far exceed the most advanced technology.

The present embodiment introduces a global idea in image restoration versus texture and structure information. Technically, a transducer model combining encoder and decoder is proposed. The encoder aims at acquiring the global texture semantic relevance of the whole image, and the decoder module restores the masked area, and is characterized in that the global texture and the structural information are well matched by adopting a structure-texture matching attention module. An adaptive block vocabulary is built that progressively covers all coarsely filled blocks through a probability diffusion process. Experimental results from benchmarking verify the advantages of our model over the most advanced work.

TABLE 1

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An image restoration method based on global texture and structure is characterized by comprising the following steps:

reconstructing rough filling blocks by using the bridging attention module and the attention score, obtaining a corpus set after multi-layer construction, and selecting a candidate block with the strongest association from the corpus set to obtain final repair output;

the calculation formula of the bridging attention module is as follows:

wherein the method comprises the steps ofRepresenting a bridging attention module,>is a linear mapping matrix, d _c ，d _r Is the dimension of the film,is a texture reference set; information->As query to the set of known blocks +.>Performing attention calculations, knowing the block set +.>Is used as a query to and +.>Attention calculation is performed so as to reconstruct the coarse structure information +.>

The attention score calculation formula is as follows:

wherein,is to directly calculate +.>And->A point of attention between; />Is a learnable linear mapping matrix, d _i Is a dimension;

the candidate block association probability is calculated as follows:

wherein O is _t-1 A set of known blocks is represented and,is a texture reference set; />Direct calculation for the Mth layer +.>And->Attention score between->Use of bridging attention module for the Mth layer->And->The attention score between the two points of interest, the lambda is the weight of the sample, I.I ₁ Is obtained by adding all attention scores related to the texture reference, thereby helping to reconstruct +.>Obtain corresponding->N _C Is the number of elements in the series, and selects the candidate +.>As a result of round t->By selecting at +.>Upper utilization I.I ₁ The sum of the calculated maximum attention scores;

the calculation formula of the rough filling block is as follows:

wherein d _m Is the dimension of the film,and->Is a learnable linear mapping matrix; by means of the attention mechanism using the non-occluded block and the remaining coarsely filled blocks +.>The formed set P _k-1 Finally adding the rough filled block ++>To further compose a set P _k ；

Acquiring a texture reference set of an image to be repaired based on a transducer-based encoder structure, wherein the transducer-based encoder structure comprises N layers, and each layer is provided with a multi-head self-attention MSA and a feedforward network FFN;

for the encoder layer I of the transducer there is:

wherein the method comprises the steps ofInput representing layer I, < >>Representing the intermediate result of layer i +.>Representing the input of layer l+1, LN (-) represents layer normalization, FFN (-) consists of two fully connected layers, each layer in turn consisting of two sublayers; the procedure reconstructs each r for MSA () _T The global semantic association is captured by a multi-headed self-attention module, which then converts the two fully connected layers into an input of the l+1 layer directing the end of the last layer.

2. The global texture and structure based image restoration method according to claim 1, wherein the formula for calculating the first layer multi-head attention mechanism is:

where h is the number of multiple heads, d _l Is the dimension of the film,is 3 mapping matrixes, j is more than or equal to 1 and less than or equal to h, W ^l Representing a connection layer, fusing outputs from different headers; after passing through the encoder layer, each texture feature vector r is reconstructed _T Is a reference vector->Thereby pooling as texture reference set->

3. The global texture and structure based image restoration method according to claim 1, further comprising an overall loss function, and minimizing it to train an overall Transformer model: