CN115035170A

CN115035170A - Image restoration method based on global texture and structure

Info

Publication number: CN115035170A
Application number: CN202210535815.4A
Authority: CN
Inventors: 王杨; 刘海鹏; 汪萌
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2022-09-09
Anticipated expiration: 2042-05-17
Also published as: CN115035170B

Abstract

The invention discloses an image restoration method based on global texture and structure, which relates to the field of image processing and comprises the following steps: inputting an image to be repaired, and acquiring a texture reference set of the image to be repaired; the method comprises the following steps of filling subsequent shielding blocks by using a known area and the shielding blocks which are already filled roughly as conditions, putting the new shielding blocks into the conditions after the new shielding blocks are filled roughly, and continuing to help the subsequent filling, wherein the method specifically comprises the following steps: selecting a reference vector from the texture reference set, repairing the rough filling block, and calculating the attention score between the texture reference set and the rough filling block; and reconstructing the rough filling blocks by using the bridging attention module and the attention fraction, obtaining a corpus set after multilayer construction, and selecting the candidate block with the strongest correlation from the corpus set to obtain the final restoration output. The repair output obtained by the method is semantically more coherent.

Description

Image restoration method based on global texture and structure

Technical Field

The invention relates to the field of image processing, in particular to an image restoration method based on global texture and structure.

Background

The image restoration is a technology for restoring the shielded area of the image, and supports various applications such as image editing and restoration. Pioneering diffusion-based and tile-based approaches can only repair smaller sized masked regions with simple pixel-level color information, and cannot capture the high-level semantics of the repaired region. To address this problem, a great deal of attention has been directed to deep models, where models based on Convolutional Neural Networks (CNNs) learn high-level semantic information following the encoder-decoder architecture. However, the partial induction priors of CNNs only receive fill information of a bounded known region within the local spatial extent of the masking region.

To solve this problem, attention-based models have been proposed. In particular, the occlusion region expressed in units of blocks is first filled with coarse content as a query for all known small blocks in the image, and then a candidate block with a higher score is selected for replacement. It is noted that PENNet proposes a cross-layer attention module, which calculates attention scores of deep feature maps, performs block replacement on the bottom-layer feature maps according to the attention scores, and finally obtains the output of the repair result through upsampling. Although it considers all known patches in the whole image, each known patch is considered independently on the occlusion region, which strategy can mislead occlusion patches to be embedded by only one known dominant patch with the largest attention score, resulting in an unsatisfactory repair output.

Similar to the attention-based approach, the transform-based model also considers information from all known regions. Rather than focusing on the patch pool, it is based on the pixel level, where each pixel of the occlusion region de-energizes a pixel of a known region as a query to be reconstructed, then projected further into a color vocabulary library to select the most relevant color for repair. The repaired pixels are then added to the pool of known pixels and the process repeats until all pixels are repaired in the predefined order. Technically, BAT and ICT propose a decoder converter that captures pixel-level structure priors through a dense attention mechanism module and projects them into a visual color corpus to select the corresponding color. On the one hand, it explores all known regions, rather than determining only limited known regions, and is therefore superior to the attention model; on the other hand, the pixel level does not capture semantics as well as the patch level, and is therefore inferior to the attention model. Furthermore, the attention score is obtained using only the location information, far from the texture semantic level. Furthermore, the transform model computes a large number of pixels, which may lead to computational burden due to the secondary complexity of the self-attention module.

Still further from a texture and structure perspective, the above methods can be essentially divided into two categories: one is a pure texture approach and the other is a structure-texture based approach. Since pure texture methods, such as CNNs-based and attention-based models, rely heavily on known texture information to recover the masked regions, ignoring the structure may result in reasonable textures that cannot be recovered; worse yet, the texture information used for the repair comes only from the bounded known area, not the entire image, and thus does not capture the semantic correlation between textures in the global image well. In contrast, the structure-texture based approach aims at generating better texture semantics for occlusion regions guided by structural constraints. Then, texture recovery is performed through a different upsampling network. In summary, their core problem is how to fill the mask area with structure information.

EdgeConnect restores the edge information to the structural information based on the edge map and the occlusion black and white map through CNNs. And combining the repaired edge image with the shielded real image containing the texture information, and recovering the shielded area through a codec model. The EII adopts a CNNs model to reconstruct the shielding area of a black-and-white image as structural constraint, and on the basis, color information is used as a texture flow to be transmitted in the image through multi-scale learning. The MEDFE follows an encoder-decoder architecture, where the goal of the encoder is to equalize structural features from deep layers of CNNs with texture features from shallow layers of CNNs, through a channel and spatial equalization process, and then as input feedback to the decoder to generate a complement image. Although structural information can be captured intuitively, the information of all known blocks is not utilized, and therefore is called a "pseudo-global structure", which may mislead to non-ideal texture recovery compared to the transform model. CTSDG recently proposed that a structure and texture can be guided to each other by a two-stream structure based on U-Net variants. However, it may use local textures to guide global structures, thereby creating blurring artifacts. On the basis, how to generate the global texture and structure information can well utilize the semantics of the whole image, and how to match the two types of global information is very beneficial to image restoration.

Disclosure of Invention

In view of this, the present invention provides an image inpainting method based on global texture and structure, so as to solve the problems existing in the background art.

In order to achieve the purpose, the invention adopts the following technical scheme:

an image restoration method based on global texture and structure comprises the following steps:

inputting an image to be repaired, and acquiring a texture reference set of the image to be repaired;

filling subsequent shielding blocks by using the known area and the shielding blocks which are already coarsely filled as conditions, putting the new shielding blocks into the conditions after the coarse filling of the new shielding blocks, and continuing to help the subsequent filling, specifically comprising the following steps:

selecting a reference vector from the texture reference set, repairing the rough filling block, and calculating the attention score between the texture reference set and the rough filling block;

and reconstructing the rough filling blocks by using the bridging attention module and the attention fraction, obtaining a corpus set after multilayer construction, and selecting the candidate block with the strongest correlation from the corpus set to obtain the final restoration output.

Optionally, the calculation formula of the bridge attention module is as follows:

wherein

Representing a bridging attention modeThe number of the blocks is such that,

is a learnable linear mapping matrix, d _c ，d _r Is the dimension of the optical fiber, and the dimension,

is a texture reference set; utilizing coarse structural information

As a query to contact a set of known blocks

Performing an attention calculation, knowing the set of blocks

Each value in (a) is as a query to and

attention calculations are performed so that the coarse structure information can be reconstructed finally

Optionally, the attention score calculation formula is as follows:

wherein the content of the first and second substances,

is a direct calculation

And

attention score of;

is a learnable linear mapping matrix, d _i Is a dimension.

Optionally, the candidate block association probability is calculated as follows:

wherein, O _t-1 A known area is represented by a known area,

is a texture reference set;

direct calculation for Mth layer

And with

The point of attention in between is,

utilizing a bridging attention module for an Mth layer

And

inter attention score, λ represents weight, | · | | non-calculation ₁ Is obtained by adding all attention scores related to the texture reference to assist in the reconstruction

To obtain corresponding

N _C Is the number of the middle elements. Picking the most relevant candidate

As a result of the t-th round

By selecting at

Utilize | | · | live in ₁ The sum of the calculated maximum attention scores.

Optionally, the coarse padding block calculation formula is as follows:

wherein, d _m Is the dimension of the optical fiber, and the dimension,

and

is a learnable linear mapping matrix; utilizing unoccluded tiles and remaining coarsely filled tiles by attention mechanism

Formed set P _k-1 Finally, add the coarsely filled blocks

To further form a set P _k 。

Optionally, the texture reference set of the image to be repaired is obtained based on a Transformer encoder structure, where the Transformer encoder structure includes N layers, and each layer has a multi-head self-attention MSA and a feed-forward network FFN.

Optionally, for the first layer of the transform encoder, there are:

wherein

An input of the l-th layer is represented,

an intermediate result of the l-th layer is shown,

represents the input representing layer l +1, LN (-) represents the layer normalization, FFN (-) consists of two fully connected layers, each layer in turn consisting of two sublayers; the process reconstructs each r for MSA (·) _T The global semantic association is captured by a multi-headed self-attention module, which two fully connected layers then convert to an input of l +1 layer directing the final layer to end.

Optionally, the formula for calculating the multi-head attention mechanism of the ith layer is as follows:

where h is the number of multiple heads, d _l Is the dimension of the optical fiber, and the dimension,

is 3 learnable mapping matrixes, j is more than or equal to 1 and less than or equal to h, W ^l Representing a learnable fully connected layer, fusing outputs from different heads; after passing through the encoder layer, each texture feature vector r is reconstructed _T As a reference vector

Thus assembled as a texture reference set

An overall loss function is also included and minimized to train the overall Transformer model:

wherein, the first and the second end of the pipe are connected with each other,

which is indicative of a loss of the reconstruction,

the loss of perception is indicated by the presence of,

expressing style loss, set λ _r ＝10；λ _p 0.1 and λ _s 250 and finally by an upsample operation I' _out Up-sampling to final result I _out 。

Compared with the prior art, the image restoration method based on the global texture and the structure has the following beneficial effects that:

1. a Transformer model comprising an encoder and a decoder is provided, wherein the aim of the encoder module is to capture semantic correlation of the whole image in texture reference, so as to obtain a global texture reference set; a rough filling attention module is designed, and the masking area is filled by using all known image blocks to obtain global structure information.

2. In order to make the decoder have the capability of combining the advantages of the two worlds of global texture reference and structure information, a structure-texture matching attention module is configured on the decoder in an intuitive attention transfer mode, and the module dynamically establishes an adaptive block vocabulary for the filled blocks on the occlusion region through a probability diffusion process.

3. To reduce the computational burden, several training techniques are disclosed to overcome the memory overhead of gpu while achieving the most advanced performance in typical benchmarking.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is an overall block diagram of the present invention;

FIG. 2 is a schematic diagram of a rough filled occlusion region according to the present invention;

FIG. 3 is a diagram of the overall architecture of the transform decoder of the present invention;

FIG. 4 is a schematic diagram of a bridge module according to the present invention;

FIG. 5 is a schematic diagram of bridge attention score increment update of the present invention

FIG. 6 is a graph comparing the results of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses an image restoration method based on global texture and structure, which adopts a transform model matched with an encoder and a decoder in order to well capture the global semantic correlation of all blocks in the whole image from the texture, wherein the encoder encodes the correlation of texture information of all blocks in a whole self-attention module, and each small block is extracted by CNNs to be a point on a characteristic graph so as to represent the semantics. In this way, the texture information of each point is represented as a texture reference vector (hereinafter simply referred to as a texture reference) as a query for reconstructing all other texture references. In other words, each texture reference encodes a different attention score for semantic relevance to all other textures in the full graph, resulting in a global texture reference. The goal of the transform decoder is to draw all occlusion blocks by all texture references. To this end, the present embodiment develops a coarse fill attention module that initially fills all of the occlusion blocks by using all of the known blocks. This embodiment is more prone to use all known patches in the image to obtain their global structure information, as opposed to their inaccurate coarse texture information. In conjunction with all texture references with global semantic relevance, the present embodiment proposes a new structure-texture matching attention module consisting of all known patches, wherein the structure information of each occlusion patch is used as a query to process all known patches, and each known patch is used as a query to process all texture references. In this way, the best match in both worlds can be achieved by using this transition and an adaptive block vocabulary consisting of repaired tiles to progressively cover all the occluded blocks through a process of probability diffusion. The overall model is shown in fig. 1.

The method specifically comprises the following steps:

filling subsequent shielding blocks by using the known area and the shielding blocks which are already coarsely filled as conditions, putting the new shielding blocks into the conditions after the new shielding blocks are coarsely filled, and continuing to help the subsequent filling, wherein the method specifically comprises the following steps:

The purpose of image restoration is to input images

And an occlusion picture M of the same size, (where the value of M is not 0, i.e. 1.) both get the occlusion picture I using elemental multiplication _m ＝I _gt An image is repaired to obtain a complete image I _out The process of (1).

To capture the texture semantic relevance of the entire image, an explicit texture representation of each patch needs to be learned. In particular, a high-level semantic feature map may be generated by a typical CNNs network, ResNet50

Each point in the feature map is I _m Corresponding to a piece of texture information of the original image. Obviously, if the feature size is as large as 32 × 32, it will result in shallow layers not capturing high-level semantics; if the network is deep and the size of the feature map becomes 8 x 8, it means that each point of the feature map carries too much semantics, resulting in the texture information of one block being mixed with the texture information of the other blocks. Thereby, for balancing, setting

Let the dimension value of each feature point be C, i.e. the output dimension of ResNet50Degree 2048. it is then mapped to a low-dimensional vector representation r _T Dimension d of _E The calculation method is to merge feature maps of 2048 channels, perform 256 convolution processes with 1 × 1, and restore the feature maps to 256 again

In the form of (1). In order to preserve spatial order information, each point on the feature map is added with a corresponding position embedding

Thus forming the final input form E for the encoder _T 。

Ready to use E _T Calculating texture correlations is self-attentive by calculation throughout the picture. The encoder structure based on Transformer comprises N layers, each layer has a multi-headed self attention (MSA) and feed-forward network (FFN). For the l-th layer there are:

wherein

An input of the l-th layer is represented,

an intermediate result of the l-th layer is shown,

representing the input representing the l +1 th layer, LN (-) represents the layer normalization, and FFN (-) consists of two Fully Connected (FC) layers, each in turn consisting of two sublayers. The main process is to reconstruct each r by MSA (-) _T The global semantic association is captured by a multi-headed self-attention module, which two fully connected layers then convert to an input of l +1 layer directing the final layer to end. Where residual connections are used around each sub-layer. The formula for calculating the multi-head attention mechanism of the first layer is as follows:

is 3 learnable mapping matrixes, j is more than or equal to 1 and less than or equal to h, W ^l Representing a learnable fully connected layer, fusing the outputs from different heads. After passing through the encoder layer, each texture feature vector r can be reconstructed _T As a reference vector

So that they can be collected as a reference set

It is easy to see that each

All other global texture correlations are encoded, where the texture correlations are different at different locations.

Except that a texture reference set is acquired

It is also crucial to use these features to express how to repair occlusion patches. Unlike existing pixel-level decode-converters, the block size needs to be considered for semantic better matching

Will I _m Down-sampling to obtain low-definition images

So as to enhance the corresponding global structure information and obtain a proper block size, I' _m Spread into a 2D sequence of blocks

Each block size is P, N ₀ Is the number of blocks, then the blocks are flattened and mapped to d by a learnable linear mapping matrix _D The dimension is the dimension of the block. Additional spatial locality embedding, whether for known or unknown patches

Is added to the expanded block to preserve spatial order.

In the discussion of

Before associating with the occlusion region, coarse information is obtained based on the known region to fill the occlusion region. Unlike previous regions that relied solely on local inductive priors of CNNs to fill in the coarse content with known patches, a global fill attention mechanism was proposed to fill in the coarse content with all known blocks in the image. For ease of understanding, it is illustrated in FIG. 2 for the k-th block m _k A coarse fill is performed with the unmasked blocks and the first k-1 blocks coarsely filled blocks. Specifically, first, all occluded blocks are sorted in ascending order by the proportion of the occlusion to fill in coarse content, and they are reconstructed with unoccluded blocks and the remaining coarse-filled blocks by the attention mechanism

Formed set P _k-1 Finally, add the coarsely filled blocks

To further form P _k . Coarse filling block

That is at P _k-1 Reconstruction of m by means of upper attention _k As a result, the calculation formula is:

wherein d is _m Is the dimension of the optical fiber, and the dimension,

and

is 3 learnable linear mapping matrices. Now discuss how to get from

In which a suitable one is selected

To repair each block

It is observed that

Or also

Is formed using the non-occluded areas of the entire image. It is clear that the following description of the preferred embodiments,

can capture global texture information well and fill in more coarse

The texture information of (2) is more accurate. However, repaired by all unoccluded blocks

The structure information of (a) is more excellent and the down-sampling operation is further enhanced. Actuated by this, except for direct use by attention mechanisms

De-reconstruction

It is also proposed to use the unoccluded block containing both the desired texture and structure as a bridging module for better matching

And

the details are shown in figure 4.

For an M-layer decoder, each layer contains two sublayers, a texture matching attention (STMA) module and an FFN function containing a 2-layer fully-connected layer, for converting the results of the attention mechanism to the input of the (l +1) -th layer, ending with the M-th layer as with the encoder. The residual form is also sampled for concatenation. For the

The calculation method at the l-th layer is as follows:

wherein STMA (-) represents the structure texture matching attention by including the repaired block

Known set of blocks of _t-1 To obtain

And with

Attention score of (1). STMA (-) of layer l is calculated by

Wherein

Is a direct calculation

And

attention score of (1).

Is a learnable linear mapping matrix, d _i Is a dimension. As mentioned previously, the asperity structure information is provided

And

it is not good to match directly, so it is proposed that the bridging attention module is based on the non-occluded blocks, thereby indirectly utilizing

De-reconstruction

The way the layer l is calculated is as follows:

wherein

A bridge-attention module is represented that,

is a learnable linear mapping matrix, d _c ，d _r Is a dimension. Where equation 6 implies an attention-shifting operation. By passing through

As a query to

The calculation of the attention is carried out,

each value in (a) is as a query to and

attention calculations are performed so that the final reconstruction is possible

Has not passed through reconstruction

Re-reconstruction

Since it is clear that the known block is a very ideal real value, it does not need to be written

The reconstruction is performed, as with equations 5 and 6:

wherein

And

is a learnable linear mapping matrix. Thus, a block-level decoder corpus is also needed to pick out the repaired blocks for each block. In particular, each coarsely filled patch

Reconstructed from equation 7 and after the Mth layer, becomes

Thus collecting to form corpus m ^C Selecting the candidate block with the strongest association from the corpus

Namely, the final repair output with the highest probability is calculated and selected through the formula 8.

Wherein the content of the first and second substances,

is a 256-dimensional vector output through the last Mth layer, the ith entry

And

coarse filling block for representing z-th block

And ith texture reference

The resulting attention scores are calculated by equations 5 and 6, respectively. I | · | purple wind ₁ Is to put all texture entries

Are added to help reconstruct all

To obtain m ^c In (b) correspond to

N _c Is m ^C The number of the elements in (B). Picking the most relevant candidate

As the result of round t-

By selecting at

Utilize | · | non-conducting phosphor ₁ The sum of the calculated maximum attention scores. I.e. for the differences

By using

The sum of the calculated attention scores. And using equation 8 to set O of known blocks through a probability diffusion process _t-1 Is expanded to O _t Thereby further helping to select the candidate repair result of the t +1 th round, and finally ending with all the areas being repaired. Decoder block corpus m ^C Is based on the adaptation of the repair result of the coarse filler blocksConstructed and dynamically updated. The overall architecture of the transform decoder is shown in fig. 3.

Calculating efficiency: one may be concerned about the computational complexity incurred by the attention module for each iteration. As shown in fig. 5, which demonstrates its effectiveness, a map of the attention scores between the set of coarse filler blocks, the set of known blocks, and the texture reference set is saved after computation. When a coarsely filled block is restored to a repaired block, the attention score map between different sets need not be recalculated, only the row of the score map corresponding to the repaired block, i.e., the coarsely filled block shown in fig. 5(a) and (b), needs to be removed, and the attention scores between the new repaired block and the texture reference set are also supplemented in fig. 5 (c). In addition, corpus m is repaired ^C After most of the candidate blocks, the probability diffusion process in equation 8 need not be repeated, especially for those blocks containing only few coarse fills, only averaging by the content of the surrounding area is needed in order to reduce the complexity of the attention mechanism

And m ^K 。

After all the occluded blocks are repaired, a reconstructed vector set is obtained, each vector is in 384-dimensional feature space, and the vectors need to be further restored to be an RGB image

Based on past work, several typical loss functions are selected to measure the repair picture I' _out And downsampled real picture I' _gt Reconstruction errors in between. Such as reconstruction loss

Loss of perception

Loss of style

And pairResistance to loss

After that, by using

Trained antagonistic neural network to' _out Up sampling to

Therefore, the temperature of the molten metal is controlled,

will not occur during training of the Transformer, training results in satisfactory I' _out The process of (2) is as follows:

reconstruction loss: by means of ₁ Loss-weighted downsampled true value I' _gt And model repair result I' _out Difference between pixels:

loss of perception: to simulate human perception of image quality, the perceptual loss is calculated by defining a distance metric between the activation profile of a pre-trained network to the restoration output and the true value, as follows:

wherein phi _i Is obtained from the ith layer of VGG and has a size of N _i ＝C _i *H _i *W _i Characteristic diagram of (phi) _i Represents the results of ReLu1_1, ReLu2_1, ReLu3_1, ReLu4_1 and ReLu5_ 1.

Style loss: the activation signature described above in equation 10 is further used to calculate a style loss to measure the difference between the covariances of the activation signatures and to mitigate "checkerboard" artifacts. Given the VGG level j activation feature map, the style loss formula is as follows:

wherein

Is a Gram matrix selected as the activation map.

Overall loss: based on the above, the overall loss function shown in equation 12 can be finally obtained and minimized to train the overall Transformer model:

in this example, we set λ _r ＝10；λ _p 0.1 and λ _s 250 and finally by an upsample operation I' _out Up-sampling to final result I _out 。

The method proposed in this example is implemented in Python and pytorech. Training with an AdamW optimizer; learning rates of the Transformer and the feature extractor are respectively set as 10e-4 and 10e-5, and weight attenuation is 10 e-4. All Transformer weights are initialized with xavieriit, with ResNet50 pre-trained with Imagenet in torchbios, and fixed batch normalization layer. We also improve feature resolution by enlarging the hole values of the last stage convolution and removing the step size from the first convolution of that stage. Both the transform encoder and decoder include four layers. The network was trained using 256 × 256 size images containing irregular occlusions, and we performed experiments on three common datasets with different features: paris StreetView (PSV), CelebA-HQ and Places 2. We trained the Transformer using 2 NVIDIA 2080TI GPUs in batch size 32 for PSV and 4 NVIDIA 2080TI GPUs in batch size 64 for CelebA-HQ and Places 2.

Quantitative evaluation of our proposed method and latest technology based on four evaluation indexes 1) L ₁ An error; 2) peak signal-to-noise ratio (PSNR); 3) index of structural similarity(SSIM) and 4) FID. L is a radical of an alcohol ₁ PSNR and SSIM are used to compare the low-level differences at the pixel level between the generated image and the true value. The FID evaluates the perception result by measuring the feature distribution distance between the generated image and the real image. The irregular masked areas in the image are verified at different scales across the image size.

Quantitative comparison: we compare our approach with the latest approach: 1) CNNs pure texture method: GC and PIC; 2) attention-based methods: HiFill; 3) structure-texture based approach: MEDFE, EC, CTSDG, EII and decode converter based methods: ICT and BAT. As can be seen from Table 1, our process has a smaller L than the previous process ₁ Error and FID scores, larger PSNR and SSIM. In particular, a small FID score verifies the advantages of the global texture reference and the structural feature representation. Since GC and PIC are texture-only methods, they fill the occluded region only with a known bounded region. HiFill calculates the similarity between each coarsely filled block and all known blocks independently, misleading that an occluded block is covered by only one explicit known region. Although the MEDFE intuitively captures the structure information, it fails to utilize the information of all known patches. Similar limitations apply to EC, CTSDG and EII. BAT and ICT recover the masked regions at the pixel level according to existing principles, and do not capture global texture semantics well. Our method is therefore superior to other methods.

And (4) qualitative comparison: to further clarify the observations, fig. 6 shows the visualization of all the methods on the three data sets. It can be seen that the repair output from our method is semantically more coherent based on the surrounding known regions.

User study: we further conducted a user survey on the data sets PSV, CelebA-HQ and Places 2. Specifically, we randomly drawn 20 test images from each dataset, inviting a total of 10 volunteers to select the most realistic image from the repair results produced by the proposed method and some of the latest methods. As shown in the last column of table 1, the results of our method far exceeded the state of the art.

The embodiment introduces a global idea in texture and structure information of image restoration. In the art, a transform model is proposed in which an encoder and a decoder are combined. The encoder aims to obtain the global texture semantic correlation of the whole image, and the decoder module recovers the covered area. An adaptive block vocabulary is built, and all the coarsely filled blocks are gradually covered through a probability diffusion process. The experimental results of the benchmark tests verify the advantages of our model over the most advanced work.

TABLE 1

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An image restoration method based on global texture and structure is characterized by comprising the following steps:

2. The method according to claim 1, wherein the calculation formula of the bridge attention module is as follows:

wherein

A bridge-attention module is represented that,

is a texture reference set; utilizing coarse structural information

As a query to contact a set of known blocks

Performing an attention calculation, knowing the set of blocks

As a query to and

3. The method according to claim 1, wherein the attention score calculation formula is as follows:

wherein the content of the first and second substances,

is a direct calculation

And

attention scores of;

is a learnable linear mapping matrix, d _i Is a dimension.

4. A method for global texture and structure based image inpainting as claimed in claim 1, wherein the candidate block association probability is calculated as follows:

wherein, O _t-1 A set of known blocks is represented by a representation,

is a texture reference set;

direct calculation for Mth layer

And

the point of attention in between is,

utilizing a bridging attention module for an Mth layer

And

inter attention score, λ represents weight, | · | | non-calculation ₁ Is obtained by adding all attention scores related to the texture reference to aid in the reconstruction

To obtain the corresponding

N _C Is the number of the elements in the list. Picking the most relevant candidate

As a result of the t-th round

By selecting at

Utilize | · | non-conducting phosphor ₁ The sum of the calculated maximum attention scores.

5. The method according to claim 1, wherein the coarse padding block is calculated according to the following formula:

wherein, d _m Is the dimension of the optical fiber, and the dimension,

and

Formed set P _k-1 Finally, add the coarsely filled blocks

To further form a set P _k 。

6. The method of claim 1, wherein a transform-based encoder structure obtains a texture reference set of the image to be restored, wherein the transform-based encoder structure comprises N layers, and each layer has a multi-headed self-attention MSA and a feed-forward network FFN.

7. The global texture and structure-based image inpainting method as claimed in claim 6, wherein for the transform encoder layer I:

wherein

An input of the l-th layer is represented,

the intermediate result of the l-th layer is shown,

8. The method of claim 7, wherein the formula for calculating the l-th layer multi-head attention mechanism is as follows:

is 3 learnable mapping matrixes, j is more than or equal to 1 and less than or equal to h, W ^l Representing a learnable connected layer, fusing outputs from different heads; after passing through the encoder layer, each texture feature vector r is reconstructed _T As a reference vector

Thus assembled as a texture reference set

9. The method of claim 1, further comprising a global loss function, and minimizing the global loss function to train a global transform model:

wherein the content of the first and second substances,

which is indicative of a loss of the reconstruction,

the loss of perception is indicated by the presence of,

expressing style loss, set λ _r ＝10；λ _p 0.1 and λ _s 250 and finally I 'by an upsampling operation' _out Up-sampling to a final result I _out 。