CN115035170A - Image restoration method based on global texture and structure - Google Patents

Image restoration method based on global texture and structure Download PDF

Info

Publication number
CN115035170A
CN115035170A CN202210535815.4A CN202210535815A CN115035170A CN 115035170 A CN115035170 A CN 115035170A CN 202210535815 A CN202210535815 A CN 202210535815A CN 115035170 A CN115035170 A CN 115035170A
Authority
CN
China
Prior art keywords
attention
texture
layer
blocks
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210535815.4A
Other languages
Chinese (zh)
Other versions
CN115035170B (en
Inventor
王杨
刘海鹏
汪萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202210535815.4A priority Critical patent/CN115035170B/en
Publication of CN115035170A publication Critical patent/CN115035170A/en
Application granted granted Critical
Publication of CN115035170B publication Critical patent/CN115035170B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/40Analysis of texture
    • G06T7/41Analysis of texture based on statistical description of texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • G06T5/77
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The invention discloses an image restoration method based on global texture and structure, which relates to the field of image processing and comprises the following steps: inputting an image to be repaired, and acquiring a texture reference set of the image to be repaired; the method comprises the following steps of filling subsequent shielding blocks by using a known area and the shielding blocks which are already filled roughly as conditions, putting the new shielding blocks into the conditions after the new shielding blocks are filled roughly, and continuing to help the subsequent filling, wherein the method specifically comprises the following steps: selecting a reference vector from the texture reference set, repairing the rough filling block, and calculating the attention score between the texture reference set and the rough filling block; and reconstructing the rough filling blocks by using the bridging attention module and the attention fraction, obtaining a corpus set after multilayer construction, and selecting the candidate block with the strongest correlation from the corpus set to obtain the final restoration output. The repair output obtained by the method is semantically more coherent.

Description

Image restoration method based on global texture and structure
Technical Field
The invention relates to the field of image processing, in particular to an image restoration method based on global texture and structure.
Background
The image restoration is a technology for restoring the shielded area of the image, and supports various applications such as image editing and restoration. Pioneering diffusion-based and tile-based approaches can only repair smaller sized masked regions with simple pixel-level color information, and cannot capture the high-level semantics of the repaired region. To address this problem, a great deal of attention has been directed to deep models, where models based on Convolutional Neural Networks (CNNs) learn high-level semantic information following the encoder-decoder architecture. However, the partial induction priors of CNNs only receive fill information of a bounded known region within the local spatial extent of the masking region.
To solve this problem, attention-based models have been proposed. In particular, the occlusion region expressed in units of blocks is first filled with coarse content as a query for all known small blocks in the image, and then a candidate block with a higher score is selected for replacement. It is noted that PENNet proposes a cross-layer attention module, which calculates attention scores of deep feature maps, performs block replacement on the bottom-layer feature maps according to the attention scores, and finally obtains the output of the repair result through upsampling. Although it considers all known patches in the whole image, each known patch is considered independently on the occlusion region, which strategy can mislead occlusion patches to be embedded by only one known dominant patch with the largest attention score, resulting in an unsatisfactory repair output.
Similar to the attention-based approach, the transform-based model also considers information from all known regions. Rather than focusing on the patch pool, it is based on the pixel level, where each pixel of the occlusion region de-energizes a pixel of a known region as a query to be reconstructed, then projected further into a color vocabulary library to select the most relevant color for repair. The repaired pixels are then added to the pool of known pixels and the process repeats until all pixels are repaired in the predefined order. Technically, BAT and ICT propose a decoder converter that captures pixel-level structure priors through a dense attention mechanism module and projects them into a visual color corpus to select the corresponding color. On the one hand, it explores all known regions, rather than determining only limited known regions, and is therefore superior to the attention model; on the other hand, the pixel level does not capture semantics as well as the patch level, and is therefore inferior to the attention model. Furthermore, the attention score is obtained using only the location information, far from the texture semantic level. Furthermore, the transform model computes a large number of pixels, which may lead to computational burden due to the secondary complexity of the self-attention module.
Still further from a texture and structure perspective, the above methods can be essentially divided into two categories: one is a pure texture approach and the other is a structure-texture based approach. Since pure texture methods, such as CNNs-based and attention-based models, rely heavily on known texture information to recover the masked regions, ignoring the structure may result in reasonable textures that cannot be recovered; worse yet, the texture information used for the repair comes only from the bounded known area, not the entire image, and thus does not capture the semantic correlation between textures in the global image well. In contrast, the structure-texture based approach aims at generating better texture semantics for occlusion regions guided by structural constraints. Then, texture recovery is performed through a different upsampling network. In summary, their core problem is how to fill the mask area with structure information.
EdgeConnect restores the edge information to the structural information based on the edge map and the occlusion black and white map through CNNs. And combining the repaired edge image with the shielded real image containing the texture information, and recovering the shielded area through a codec model. The EII adopts a CNNs model to reconstruct the shielding area of a black-and-white image as structural constraint, and on the basis, color information is used as a texture flow to be transmitted in the image through multi-scale learning. The MEDFE follows an encoder-decoder architecture, where the goal of the encoder is to equalize structural features from deep layers of CNNs with texture features from shallow layers of CNNs, through a channel and spatial equalization process, and then as input feedback to the decoder to generate a complement image. Although structural information can be captured intuitively, the information of all known blocks is not utilized, and therefore is called a "pseudo-global structure", which may mislead to non-ideal texture recovery compared to the transform model. CTSDG recently proposed that a structure and texture can be guided to each other by a two-stream structure based on U-Net variants. However, it may use local textures to guide global structures, thereby creating blurring artifacts. On the basis, how to generate the global texture and structure information can well utilize the semantics of the whole image, and how to match the two types of global information is very beneficial to image restoration.
Disclosure of Invention
In view of this, the present invention provides an image inpainting method based on global texture and structure, so as to solve the problems existing in the background art.
In order to achieve the purpose, the invention adopts the following technical scheme:
an image restoration method based on global texture and structure comprises the following steps:
inputting an image to be repaired, and acquiring a texture reference set of the image to be repaired;
filling subsequent shielding blocks by using the known area and the shielding blocks which are already coarsely filled as conditions, putting the new shielding blocks into the conditions after the coarse filling of the new shielding blocks, and continuing to help the subsequent filling, specifically comprising the following steps:
selecting a reference vector from the texture reference set, repairing the rough filling block, and calculating the attention score between the texture reference set and the rough filling block;
and reconstructing the rough filling blocks by using the bridging attention module and the attention fraction, obtaining a corpus set after multilayer construction, and selecting the candidate block with the strongest correlation from the corpus set to obtain the final restoration output.
Optionally, the calculation formula of the bridge attention module is as follows:
Figure BDA0003648172940000031
wherein
Figure BDA0003648172940000032
Representing a bridging attention modeThe number of the blocks is such that,
Figure BDA0003648172940000033
is a learnable linear mapping matrix, d c ,d r Is the dimension of the optical fiber, and the dimension,
Figure BDA0003648172940000034
is a texture reference set; utilizing coarse structural information
Figure BDA0003648172940000035
As a query to contact a set of known blocks
Figure BDA0003648172940000041
Performing an attention calculation, knowing the set of blocks
Figure BDA0003648172940000042
Each value in (a) is as a query to and
Figure BDA0003648172940000043
attention calculations are performed so that the coarse structure information can be reconstructed finally
Figure BDA0003648172940000044
Optionally, the attention score calculation formula is as follows:
Figure BDA0003648172940000045
wherein the content of the first and second substances,
Figure BDA0003648172940000046
is a direct calculation
Figure BDA0003648172940000047
And
Figure BDA0003648172940000048
attention score of;
Figure BDA0003648172940000049
is a learnable linear mapping matrix, d i Is a dimension.
Optionally, the candidate block association probability is calculated as follows:
Figure BDA00036481729400000410
Figure BDA00036481729400000411
wherein, O t-1 A known area is represented by a known area,
Figure BDA00036481729400000412
is a texture reference set;
Figure BDA00036481729400000429
direct calculation for Mth layer
Figure BDA00036481729400000413
And with
Figure BDA00036481729400000414
The point of attention in between is,
Figure BDA00036481729400000415
utilizing a bridging attention module for an Mth layer
Figure BDA00036481729400000416
And
Figure BDA00036481729400000417
inter attention score, λ represents weight, | · | | non-calculation 1 Is obtained by adding all attention scores related to the texture reference to assist in the reconstruction
Figure BDA00036481729400000418
To obtain corresponding
Figure BDA00036481729400000419
N C Is the number of the middle elements. Picking the most relevant candidate
Figure BDA00036481729400000420
As a result of the t-th round
Figure BDA00036481729400000421
By selecting at
Figure BDA00036481729400000422
Utilize | | · | live in 1 The sum of the calculated maximum attention scores.
Optionally, the coarse padding block calculation formula is as follows:
Figure BDA00036481729400000423
Figure BDA00036481729400000424
wherein, d m Is the dimension of the optical fiber, and the dimension,
Figure BDA00036481729400000425
and
Figure BDA00036481729400000426
is a learnable linear mapping matrix; utilizing unoccluded tiles and remaining coarsely filled tiles by attention mechanism
Figure BDA00036481729400000427
Formed set P k-1 Finally, add the coarsely filled blocks
Figure BDA00036481729400000428
To further form a set P k
Optionally, the texture reference set of the image to be repaired is obtained based on a Transformer encoder structure, where the Transformer encoder structure includes N layers, and each layer has a multi-head self-attention MSA and a feed-forward network FFN.
Optionally, for the first layer of the transform encoder, there are:
Figure BDA0003648172940000051
Figure BDA0003648172940000052
wherein
Figure BDA0003648172940000053
An input of the l-th layer is represented,
Figure BDA0003648172940000054
an intermediate result of the l-th layer is shown,
Figure BDA0003648172940000055
represents the input representing layer l +1, LN (-) represents the layer normalization, FFN (-) consists of two fully connected layers, each layer in turn consisting of two sublayers; the process reconstructs each r for MSA (·) T The global semantic association is captured by a multi-headed self-attention module, which two fully connected layers then convert to an input of l +1 layer directing the final layer to end.
Optionally, the formula for calculating the multi-head attention mechanism of the ith layer is as follows:
Figure BDA0003648172940000056
Figure BDA0003648172940000057
where h is the number of multiple heads, d l Is the dimension of the optical fiber, and the dimension,
Figure BDA0003648172940000058
is 3 learnable mapping matrixes, j is more than or equal to 1 and less than or equal to h, W l Representing a learnable fully connected layer, fusing outputs from different heads; after passing through the encoder layer, each texture feature vector r is reconstructed T As a reference vector
Figure BDA0003648172940000059
Thus assembled as a texture reference set
Figure BDA00036481729400000510
An overall loss function is also included and minimized to train the overall Transformer model:
Figure BDA00036481729400000511
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00036481729400000512
which is indicative of a loss of the reconstruction,
Figure BDA00036481729400000513
the loss of perception is indicated by the presence of,
Figure BDA00036481729400000514
expressing style loss, set λ r =10;λ p 0.1 and λ s 250 and finally by an upsample operation I' out Up-sampling to final result I out
Compared with the prior art, the image restoration method based on the global texture and the structure has the following beneficial effects that:
1. a Transformer model comprising an encoder and a decoder is provided, wherein the aim of the encoder module is to capture semantic correlation of the whole image in texture reference, so as to obtain a global texture reference set; a rough filling attention module is designed, and the masking area is filled by using all known image blocks to obtain global structure information.
2. In order to make the decoder have the capability of combining the advantages of the two worlds of global texture reference and structure information, a structure-texture matching attention module is configured on the decoder in an intuitive attention transfer mode, and the module dynamically establishes an adaptive block vocabulary for the filled blocks on the occlusion region through a probability diffusion process.
3. To reduce the computational burden, several training techniques are disclosed to overcome the memory overhead of gpu while achieving the most advanced performance in typical benchmarking.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is an overall block diagram of the present invention;
FIG. 2 is a schematic diagram of a rough filled occlusion region according to the present invention;
FIG. 3 is a diagram of the overall architecture of the transform decoder of the present invention;
FIG. 4 is a schematic diagram of a bridge module according to the present invention;
FIG. 5 is a schematic diagram of bridge attention score increment update of the present invention
FIG. 6 is a graph comparing the results of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses an image restoration method based on global texture and structure, which adopts a transform model matched with an encoder and a decoder in order to well capture the global semantic correlation of all blocks in the whole image from the texture, wherein the encoder encodes the correlation of texture information of all blocks in a whole self-attention module, and each small block is extracted by CNNs to be a point on a characteristic graph so as to represent the semantics. In this way, the texture information of each point is represented as a texture reference vector (hereinafter simply referred to as a texture reference) as a query for reconstructing all other texture references. In other words, each texture reference encodes a different attention score for semantic relevance to all other textures in the full graph, resulting in a global texture reference. The goal of the transform decoder is to draw all occlusion blocks by all texture references. To this end, the present embodiment develops a coarse fill attention module that initially fills all of the occlusion blocks by using all of the known blocks. This embodiment is more prone to use all known patches in the image to obtain their global structure information, as opposed to their inaccurate coarse texture information. In conjunction with all texture references with global semantic relevance, the present embodiment proposes a new structure-texture matching attention module consisting of all known patches, wherein the structure information of each occlusion patch is used as a query to process all known patches, and each known patch is used as a query to process all texture references. In this way, the best match in both worlds can be achieved by using this transition and an adaptive block vocabulary consisting of repaired tiles to progressively cover all the occluded blocks through a process of probability diffusion. The overall model is shown in fig. 1.
The method specifically comprises the following steps:
inputting an image to be repaired, and acquiring a texture reference set of the image to be repaired;
filling subsequent shielding blocks by using the known area and the shielding blocks which are already coarsely filled as conditions, putting the new shielding blocks into the conditions after the new shielding blocks are coarsely filled, and continuing to help the subsequent filling, wherein the method specifically comprises the following steps:
selecting a reference vector from the texture reference set, repairing the rough filling block, and calculating the attention score between the texture reference set and the rough filling block;
and reconstructing the rough filling blocks by using the bridging attention module and the attention fraction, obtaining a corpus set after multilayer construction, and selecting the candidate block with the strongest correlation from the corpus set to obtain the final restoration output.
The purpose of image restoration is to input images
Figure BDA0003648172940000081
And an occlusion picture M of the same size, (where the value of M is not 0, i.e. 1.) both get the occlusion picture I using elemental multiplication m =I gt An image is repaired to obtain a complete image I out The process of (1).
To capture the texture semantic relevance of the entire image, an explicit texture representation of each patch needs to be learned. In particular, a high-level semantic feature map may be generated by a typical CNNs network, ResNet50
Figure BDA0003648172940000082
Figure BDA0003648172940000083
Each point in the feature map is I m Corresponding to a piece of texture information of the original image. Obviously, if the feature size is as large as 32 × 32, it will result in shallow layers not capturing high-level semantics; if the network is deep and the size of the feature map becomes 8 x 8, it means that each point of the feature map carries too much semantics, resulting in the texture information of one block being mixed with the texture information of the other blocks. Thereby, for balancing, setting
Figure BDA0003648172940000084
Figure BDA0003648172940000085
Let the dimension value of each feature point be C, i.e. the output dimension of ResNet50Degree 2048. it is then mapped to a low-dimensional vector representation r T Dimension d of E The calculation method is to merge feature maps of 2048 channels, perform 256 convolution processes with 1 × 1, and restore the feature maps to 256 again
Figure BDA0003648172940000086
In the form of (1). In order to preserve spatial order information, each point on the feature map is added with a corresponding position embedding
Figure BDA0003648172940000087
Thus forming the final input form E for the encoder T
Ready to use E T Calculating texture correlations is self-attentive by calculation throughout the picture. The encoder structure based on Transformer comprises N layers, each layer has a multi-headed self attention (MSA) and feed-forward network (FFN). For the l-th layer there are:
Figure BDA0003648172940000088
wherein
Figure BDA0003648172940000089
An input of the l-th layer is represented,
Figure BDA00036481729400000810
an intermediate result of the l-th layer is shown,
Figure BDA00036481729400000811
representing the input representing the l +1 th layer, LN (-) represents the layer normalization, and FFN (-) consists of two Fully Connected (FC) layers, each in turn consisting of two sublayers. The main process is to reconstruct each r by MSA (-) T The global semantic association is captured by a multi-headed self-attention module, which two fully connected layers then convert to an input of l +1 layer directing the final layer to end. Where residual connections are used around each sub-layer. The formula for calculating the multi-head attention mechanism of the first layer is as follows:
Figure BDA0003648172940000091
where h is the number of multiple heads, d l Is the dimension of the optical fiber, and the dimension,
Figure BDA0003648172940000092
is 3 learnable mapping matrixes, j is more than or equal to 1 and less than or equal to h, W l Representing a learnable fully connected layer, fusing the outputs from different heads. After passing through the encoder layer, each texture feature vector r can be reconstructed T As a reference vector
Figure BDA0003648172940000093
So that they can be collected as a reference set
Figure BDA0003648172940000094
It is easy to see that each
Figure BDA0003648172940000095
All other global texture correlations are encoded, where the texture correlations are different at different locations.
Except that a texture reference set is acquired
Figure BDA0003648172940000096
It is also crucial to use these features to express how to repair occlusion patches. Unlike existing pixel-level decode-converters, the block size needs to be considered for semantic better matching
Figure BDA0003648172940000097
Will I m Down-sampling to obtain low-definition images
Figure BDA0003648172940000098
So as to enhance the corresponding global structure information and obtain a proper block size, I' m Spread into a 2D sequence of blocks
Figure BDA0003648172940000099
Figure BDA00036481729400000910
Each block size is P, N 0 Is the number of blocks, then the blocks are flattened and mapped to d by a learnable linear mapping matrix D The dimension is the dimension of the block. Additional spatial locality embedding, whether for known or unknown patches
Figure BDA00036481729400000911
Is added to the expanded block to preserve spatial order.
In the discussion of
Figure BDA00036481729400000912
Before associating with the occlusion region, coarse information is obtained based on the known region to fill the occlusion region. Unlike previous regions that relied solely on local inductive priors of CNNs to fill in the coarse content with known patches, a global fill attention mechanism was proposed to fill in the coarse content with all known blocks in the image. For ease of understanding, it is illustrated in FIG. 2 for the k-th block m k A coarse fill is performed with the unmasked blocks and the first k-1 blocks coarsely filled blocks. Specifically, first, all occluded blocks are sorted in ascending order by the proportion of the occlusion to fill in coarse content, and they are reconstructed with unoccluded blocks and the remaining coarse-filled blocks by the attention mechanism
Figure BDA00036481729400000913
Figure BDA0003648172940000101
Formed set P k-1 Finally, add the coarsely filled blocks
Figure BDA0003648172940000102
To further form P k . Coarse filling block
Figure BDA0003648172940000103
That is at P k-1 Reconstruction of m by means of upper attention k As a result, the calculation formula is:
Figure BDA0003648172940000104
wherein d is m Is the dimension of the optical fiber, and the dimension,
Figure BDA0003648172940000105
and
Figure BDA0003648172940000106
is 3 learnable linear mapping matrices. Now discuss how to get from
Figure BDA0003648172940000107
In which a suitable one is selected
Figure BDA0003648172940000108
To repair each block
Figure BDA0003648172940000109
It is observed that
Figure BDA00036481729400001010
Or also
Figure BDA00036481729400001011
Is formed using the non-occluded areas of the entire image. It is clear that the following description of the preferred embodiments,
Figure BDA00036481729400001012
can capture global texture information well and fill in more coarse
Figure BDA00036481729400001013
The texture information of (2) is more accurate. However, repaired by all unoccluded blocks
Figure BDA00036481729400001014
The structure information of (a) is more excellent and the down-sampling operation is further enhanced. Actuated by this, except for direct use by attention mechanisms
Figure BDA00036481729400001015
De-reconstruction
Figure BDA00036481729400001016
It is also proposed to use the unoccluded block containing both the desired texture and structure as a bridging module for better matching
Figure BDA00036481729400001017
And
Figure BDA00036481729400001018
the details are shown in figure 4.
For an M-layer decoder, each layer contains two sublayers, a texture matching attention (STMA) module and an FFN function containing a 2-layer fully-connected layer, for converting the results of the attention mechanism to the input of the (l +1) -th layer, ending with the M-th layer as with the encoder. The residual form is also sampled for concatenation. For the
Figure BDA00036481729400001019
The calculation method at the l-th layer is as follows:
Figure BDA00036481729400001020
wherein STMA (-) represents the structure texture matching attention by including the repaired block
Figure BDA00036481729400001021
Figure BDA00036481729400001022
Known set of blocks of t-1 To obtain
Figure BDA00036481729400001023
And with
Figure BDA00036481729400001024
Attention score of (1). STMA (-) of layer l is calculated by
Figure BDA00036481729400001025
Wherein
Figure BDA00036481729400001026
Is a direct calculation
Figure BDA00036481729400001027
And
Figure BDA00036481729400001028
attention score of (1).
Figure BDA00036481729400001029
Is a learnable linear mapping matrix, d i Is a dimension. As mentioned previously, the asperity structure information is provided
Figure BDA00036481729400001030
And
Figure BDA00036481729400001031
it is not good to match directly, so it is proposed that the bridging attention module is based on the non-occluded blocks, thereby indirectly utilizing
Figure BDA0003648172940000111
De-reconstruction
Figure BDA0003648172940000112
The way the layer l is calculated is as follows:
Figure BDA0003648172940000113
wherein
Figure BDA0003648172940000114
A bridge-attention module is represented that,
Figure BDA0003648172940000115
is a learnable linear mapping matrix, d c ,d r Is a dimension. Where equation 6 implies an attention-shifting operation. By passing through
Figure BDA0003648172940000116
As a query to
Figure BDA0003648172940000117
The calculation of the attention is carried out,
Figure BDA0003648172940000118
each value in (a) is as a query to and
Figure BDA0003648172940000119
attention calculations are performed so that the final reconstruction is possible
Figure BDA00036481729400001110
Has not passed through reconstruction
Figure BDA00036481729400001111
Re-reconstruction
Figure BDA00036481729400001112
Since it is clear that the known block is a very ideal real value, it does not need to be written
Figure BDA00036481729400001113
The reconstruction is performed, as with equations 5 and 6:
Figure BDA00036481729400001114
wherein
Figure BDA00036481729400001115
And
Figure BDA00036481729400001116
is a learnable linear mapping matrix. Thus, a block-level decoder corpus is also needed to pick out the repaired blocks for each block. In particular, each coarsely filled patch
Figure BDA00036481729400001117
Reconstructed from equation 7 and after the Mth layer, becomes
Figure BDA00036481729400001118
Thus collecting to form corpus m C Selecting the candidate block with the strongest association from the corpus
Figure BDA00036481729400001119
Namely, the final repair output with the highest probability is calculated and selected through the formula 8.
Figure BDA00036481729400001120
Figure BDA00036481729400001121
Wherein the content of the first and second substances,
Figure BDA00036481729400001122
Figure BDA00036481729400001123
is a 256-dimensional vector output through the last Mth layer, the ith entry
Figure BDA00036481729400001124
And
Figure BDA00036481729400001125
coarse filling block for representing z-th block
Figure BDA00036481729400001126
And ith texture reference
Figure BDA00036481729400001127
The resulting attention scores are calculated by equations 5 and 6, respectively. I | · | purple wind 1 Is to put all texture entries
Figure BDA00036481729400001128
Are added to help reconstruct all
Figure BDA00036481729400001129
To obtain m c In (b) correspond to
Figure BDA00036481729400001130
N c Is m C The number of the elements in (B). Picking the most relevant candidate
Figure BDA00036481729400001131
As the result of round t-
Figure BDA00036481729400001132
By selecting at
Figure BDA00036481729400001133
Utilize | · | non-conducting phosphor 1 The sum of the calculated maximum attention scores. I.e. for the differences
Figure BDA00036481729400001134
By using
Figure BDA0003648172940000121
The sum of the calculated attention scores. And using equation 8 to set O of known blocks through a probability diffusion process t-1 Is expanded to O t Thereby further helping to select the candidate repair result of the t +1 th round, and finally ending with all the areas being repaired. Decoder block corpus m C Is based on the adaptation of the repair result of the coarse filler blocksConstructed and dynamically updated. The overall architecture of the transform decoder is shown in fig. 3.
Calculating efficiency: one may be concerned about the computational complexity incurred by the attention module for each iteration. As shown in fig. 5, which demonstrates its effectiveness, a map of the attention scores between the set of coarse filler blocks, the set of known blocks, and the texture reference set is saved after computation. When a coarsely filled block is restored to a repaired block, the attention score map between different sets need not be recalculated, only the row of the score map corresponding to the repaired block, i.e., the coarsely filled block shown in fig. 5(a) and (b), needs to be removed, and the attention scores between the new repaired block and the texture reference set are also supplemented in fig. 5 (c). In addition, corpus m is repaired C After most of the candidate blocks, the probability diffusion process in equation 8 need not be repeated, especially for those blocks containing only few coarse fills, only averaging by the content of the surrounding area is needed in order to reduce the complexity of the attention mechanism
Figure BDA0003648172940000122
And m K
After all the occluded blocks are repaired, a reconstructed vector set is obtained, each vector is in 384-dimensional feature space, and the vectors need to be further restored to be an RGB image
Figure BDA0003648172940000123
Based on past work, several typical loss functions are selected to measure the repair picture I' out And downsampled real picture I' gt Reconstruction errors in between. Such as reconstruction loss
Figure BDA0003648172940000124
Loss of perception
Figure BDA0003648172940000125
Loss of style
Figure BDA0003648172940000126
And pairResistance to loss
Figure BDA0003648172940000127
After that, by using
Figure BDA0003648172940000128
Trained antagonistic neural network to' out Up sampling to
Figure BDA0003648172940000129
Therefore, the temperature of the molten metal is controlled,
Figure BDA00036481729400001210
will not occur during training of the Transformer, training results in satisfactory I' out The process of (2) is as follows:
reconstruction loss: by means of 1 Loss-weighted downsampled true value I' gt And model repair result I' out Difference between pixels:
Figure BDA00036481729400001211
loss of perception: to simulate human perception of image quality, the perceptual loss is calculated by defining a distance metric between the activation profile of a pre-trained network to the restoration output and the true value, as follows:
Figure BDA0003648172940000131
wherein phi i Is obtained from the ith layer of VGG and has a size of N i =C i *H i *W i Characteristic diagram of (phi) i Represents the results of ReLu1_1, ReLu2_1, ReLu3_1, ReLu4_1 and ReLu5_ 1.
Style loss: the activation signature described above in equation 10 is further used to calculate a style loss to measure the difference between the covariances of the activation signatures and to mitigate "checkerboard" artifacts. Given the VGG level j activation feature map, the style loss formula is as follows:
Figure BDA0003648172940000132
wherein
Figure BDA0003648172940000133
Is a Gram matrix selected as the activation map.
Overall loss: based on the above, the overall loss function shown in equation 12 can be finally obtained and minimized to train the overall Transformer model:
Figure BDA0003648172940000134
in this example, we set λ r =10;λ p 0.1 and λ s 250 and finally by an upsample operation I' out Up-sampling to final result I out
The method proposed in this example is implemented in Python and pytorech. Training with an AdamW optimizer; learning rates of the Transformer and the feature extractor are respectively set as 10e-4 and 10e-5, and weight attenuation is 10 e-4. All Transformer weights are initialized with xavieriit, with ResNet50 pre-trained with Imagenet in torchbios, and fixed batch normalization layer. We also improve feature resolution by enlarging the hole values of the last stage convolution and removing the step size from the first convolution of that stage. Both the transform encoder and decoder include four layers. The network was trained using 256 × 256 size images containing irregular occlusions, and we performed experiments on three common datasets with different features: paris StreetView (PSV), CelebA-HQ and Places 2. We trained the Transformer using 2 NVIDIA 2080TI GPUs in batch size 32 for PSV and 4 NVIDIA 2080TI GPUs in batch size 64 for CelebA-HQ and Places 2.
Quantitative evaluation of our proposed method and latest technology based on four evaluation indexes 1) L 1 An error; 2) peak signal-to-noise ratio (PSNR); 3) index of structural similarity(SSIM) and 4) FID. L is a radical of an alcohol 1 PSNR and SSIM are used to compare the low-level differences at the pixel level between the generated image and the true value. The FID evaluates the perception result by measuring the feature distribution distance between the generated image and the real image. The irregular masked areas in the image are verified at different scales across the image size.
Quantitative comparison: we compare our approach with the latest approach: 1) CNNs pure texture method: GC and PIC; 2) attention-based methods: HiFill; 3) structure-texture based approach: MEDFE, EC, CTSDG, EII and decode converter based methods: ICT and BAT. As can be seen from Table 1, our process has a smaller L than the previous process 1 Error and FID scores, larger PSNR and SSIM. In particular, a small FID score verifies the advantages of the global texture reference and the structural feature representation. Since GC and PIC are texture-only methods, they fill the occluded region only with a known bounded region. HiFill calculates the similarity between each coarsely filled block and all known blocks independently, misleading that an occluded block is covered by only one explicit known region. Although the MEDFE intuitively captures the structure information, it fails to utilize the information of all known patches. Similar limitations apply to EC, CTSDG and EII. BAT and ICT recover the masked regions at the pixel level according to existing principles, and do not capture global texture semantics well. Our method is therefore superior to other methods.
And (4) qualitative comparison: to further clarify the observations, fig. 6 shows the visualization of all the methods on the three data sets. It can be seen that the repair output from our method is semantically more coherent based on the surrounding known regions.
User study: we further conducted a user survey on the data sets PSV, CelebA-HQ and Places 2. Specifically, we randomly drawn 20 test images from each dataset, inviting a total of 10 volunteers to select the most realistic image from the repair results produced by the proposed method and some of the latest methods. As shown in the last column of table 1, the results of our method far exceeded the state of the art.
The embodiment introduces a global idea in texture and structure information of image restoration. In the art, a transform model is proposed in which an encoder and a decoder are combined. The encoder aims to obtain the global texture semantic correlation of the whole image, and the decoder module recovers the covered area. An adaptive block vocabulary is built, and all the coarsely filled blocks are gradually covered through a probability diffusion process. The experimental results of the benchmark tests verify the advantages of our model over the most advanced work.
TABLE 1
Figure BDA0003648172940000151
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. An image restoration method based on global texture and structure is characterized by comprising the following steps:
inputting an image to be repaired, and acquiring a texture reference set of the image to be repaired;
filling subsequent shielding blocks by using the known area and the shielding blocks which are already coarsely filled as conditions, putting the new shielding blocks into the conditions after the new shielding blocks are coarsely filled, and continuing to help the subsequent filling, wherein the method specifically comprises the following steps:
selecting a reference vector from the texture reference set, repairing the rough filling block, and calculating the attention score between the texture reference set and the rough filling block;
and reconstructing the rough filling blocks by using the bridging attention module and the attention fraction, obtaining a corpus set after multilayer construction, and selecting the candidate block with the strongest correlation from the corpus set to obtain the final restoration output.
2. The method according to claim 1, wherein the calculation formula of the bridge attention module is as follows:
Figure FDA0003648172930000011
wherein
Figure FDA0003648172930000012
A bridge-attention module is represented that,
Figure FDA0003648172930000013
is a learnable linear mapping matrix, d c ,d r Is the dimension of the optical fiber, and the dimension,
Figure FDA0003648172930000014
is a texture reference set; utilizing coarse structural information
Figure FDA0003648172930000015
As a query to contact a set of known blocks
Figure FDA0003648172930000016
Performing an attention calculation, knowing the set of blocks
Figure FDA0003648172930000017
As a query to and
Figure FDA0003648172930000018
attention calculations are performed so that the coarse structure information can be reconstructed finally
Figure FDA0003648172930000019
3. The method according to claim 1, wherein the attention score calculation formula is as follows:
Figure FDA00036481729300000110
wherein the content of the first and second substances,
Figure FDA00036481729300000111
is a direct calculation
Figure FDA00036481729300000112
And
Figure FDA00036481729300000113
attention scores of;
Figure FDA00036481729300000114
is a learnable linear mapping matrix, d i Is a dimension.
4. A method for global texture and structure based image inpainting as claimed in claim 1, wherein the candidate block association probability is calculated as follows:
Figure FDA0003648172930000021
Figure FDA0003648172930000022
wherein, O t-1 A set of known blocks is represented by a representation,
Figure FDA0003648172930000023
is a texture reference set;
Figure FDA0003648172930000024
direct calculation for Mth layer
Figure FDA0003648172930000025
And
Figure FDA0003648172930000026
the point of attention in between is,
Figure FDA0003648172930000027
utilizing a bridging attention module for an Mth layer
Figure FDA0003648172930000028
And
Figure FDA0003648172930000029
inter attention score, λ represents weight, | · | | non-calculation 1 Is obtained by adding all attention scores related to the texture reference to aid in the reconstruction
Figure FDA00036481729300000210
To obtain the corresponding
Figure FDA00036481729300000211
N C Is the number of the elements in the list. Picking the most relevant candidate
Figure FDA00036481729300000212
As a result of the t-th round
Figure FDA00036481729300000213
By selecting at
Figure FDA00036481729300000214
Utilize | · | non-conducting phosphor 1 The sum of the calculated maximum attention scores.
5. The method according to claim 1, wherein the coarse padding block is calculated according to the following formula:
Figure FDA00036481729300000215
Figure FDA00036481729300000216
wherein, d m Is the dimension of the optical fiber, and the dimension,
Figure FDA00036481729300000217
and
Figure FDA00036481729300000218
is a learnable linear mapping matrix; utilizing unoccluded tiles and remaining coarsely filled tiles by attention mechanism
Figure FDA00036481729300000219
Formed set P k-1 Finally, add the coarsely filled blocks
Figure FDA00036481729300000220
To further form a set P k
6. The method of claim 1, wherein a transform-based encoder structure obtains a texture reference set of the image to be restored, wherein the transform-based encoder structure comprises N layers, and each layer has a multi-headed self-attention MSA and a feed-forward network FFN.
7. The global texture and structure-based image inpainting method as claimed in claim 6, wherein for the transform encoder layer I:
Figure FDA0003648172930000031
Figure FDA0003648172930000032
wherein
Figure FDA0003648172930000033
An input of the l-th layer is represented,
Figure FDA0003648172930000034
the intermediate result of the l-th layer is shown,
Figure FDA0003648172930000035
represents the input representing layer l +1, LN (-) represents the layer normalization, FFN (-) consists of two fully connected layers, each layer in turn consisting of two sublayers; the process reconstructs each r for MSA (·) T The global semantic association is captured by a multi-headed self-attention module, which two fully connected layers then convert to an input of l +1 layer directing the final layer to end.
8. The method of claim 7, wherein the formula for calculating the l-th layer multi-head attention mechanism is as follows:
Figure FDA0003648172930000036
Figure FDA0003648172930000037
where h is the number of multiple heads, d l Is the dimension of the optical fiber, and the dimension,
Figure FDA0003648172930000038
is 3 learnable mapping matrixes, j is more than or equal to 1 and less than or equal to h, W l Representing a learnable connected layer, fusing outputs from different heads; after passing through the encoder layer, each texture feature vector r is reconstructed T As a reference vector
Figure FDA0003648172930000039
Thus assembled as a texture reference set
Figure FDA00036481729300000310
9. The method of claim 1, further comprising a global loss function, and minimizing the global loss function to train a global transform model:
Figure FDA00036481729300000311
wherein the content of the first and second substances,
Figure FDA00036481729300000312
which is indicative of a loss of the reconstruction,
Figure FDA00036481729300000313
the loss of perception is indicated by the presence of,
Figure FDA00036481729300000314
expressing style loss, set λ r =10;λ p 0.1 and λ s 250 and finally I 'by an upsampling operation' out Up-sampling to a final result I out
CN202210535815.4A 2022-05-17 2022-05-17 Image restoration method based on global texture and structure Active CN115035170B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210535815.4A CN115035170B (en) 2022-05-17 2022-05-17 Image restoration method based on global texture and structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210535815.4A CN115035170B (en) 2022-05-17 2022-05-17 Image restoration method based on global texture and structure

Publications (2)

Publication Number Publication Date
CN115035170A true CN115035170A (en) 2022-09-09
CN115035170B CN115035170B (en) 2024-03-05

Family

ID=83121173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210535815.4A Active CN115035170B (en) 2022-05-17 2022-05-17 Image restoration method based on global texture and structure

Country Status (1)

Country Link
CN (1) CN115035170B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115908205A (en) * 2023-02-21 2023-04-04 成都信息工程大学 Image restoration method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6830707B1 (en) * 2020-01-23 2021-02-17 同▲済▼大学 Person re-identification method that combines random batch mask and multi-scale expression learning
CN113469906A (en) * 2021-06-24 2021-10-01 湖南大学 Cross-layer global and local perception network method for image restoration
US20210390700A1 (en) * 2020-06-12 2021-12-16 Adobe Inc. Referring image segmentation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6830707B1 (en) * 2020-01-23 2021-02-17 同▲済▼大学 Person re-identification method that combines random batch mask and multi-scale expression learning
US20210390700A1 (en) * 2020-06-12 2021-12-16 Adobe Inc. Referring image segmentation
CN113469906A (en) * 2021-06-24 2021-10-01 湖南大学 Cross-layer global and local perception network method for image restoration

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邵杭;王永雄;: "基于并行对抗与多条件融合的生成式高分辨率图像修复", 模式识别与人工智能, no. 04, 15 April 2020 (2020-04-15) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115908205A (en) * 2023-02-21 2023-04-04 成都信息工程大学 Image restoration method and device, electronic equipment and storage medium
CN115908205B (en) * 2023-02-21 2023-05-30 成都信息工程大学 Image restoration method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN115035170B (en) 2024-03-05

Similar Documents

Publication Publication Date Title
CN111784602B (en) Method for generating countermeasure network for image restoration
CN109087273B (en) Image restoration method, storage medium and system based on enhanced neural network
CN114463209B (en) Image restoration method based on deep multi-feature collaborative learning
CN114627006B (en) Progressive image restoration method based on depth decoupling network
CN114445292A (en) Multi-stage progressive underwater image enhancement method
CN113538234A (en) Remote sensing image super-resolution reconstruction method based on lightweight generation model
CN115170915A (en) Infrared and visible light image fusion method based on end-to-end attention network
CN116485741A (en) No-reference image quality evaluation method, system, electronic equipment and storage medium
CN110874575A (en) Face image processing method and related equipment
CN116757986A (en) Infrared and visible light image fusion method and device
CN112163998A (en) Single-image super-resolution analysis method matched with natural degradation conditions
CN116739899A (en) Image super-resolution reconstruction method based on SAUGAN network
CN115035170A (en) Image restoration method based on global texture and structure
CN113379606B (en) Face super-resolution method based on pre-training generation model
CN112686830B (en) Super-resolution method of single depth map based on image decomposition
Yu et al. MagConv: Mask-guided convolution for image inpainting
CN116109510A (en) Face image restoration method based on structure and texture dual generation
CN115578638A (en) Method for constructing multi-level feature interactive defogging network based on U-Net
CN114862696A (en) Facial image restoration method based on contour and semantic guidance
Kumar et al. Underwater Image Enhancement using deep learning
CN114359180A (en) Virtual reality-oriented image quality evaluation method
CN117196981B (en) Bidirectional information flow method based on texture and structure reconciliation
CN116523985B (en) Structure and texture feature guided double-encoder image restoration method
CN113688694B (en) Method and device for improving video definition based on unpaired learning
Peng et al. RAUNE-Net: A Residual and Attention-Driven Underwater Image Enhancement Method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant