CN115035170A - Image restoration method based on global texture and structure - Google Patents
Image restoration method based on global texture and structure Download PDFInfo
- Publication number
- CN115035170A CN115035170A CN202210535815.4A CN202210535815A CN115035170A CN 115035170 A CN115035170 A CN 115035170A CN 202210535815 A CN202210535815 A CN 202210535815A CN 115035170 A CN115035170 A CN 115035170A
- Authority
- CN
- China
- Prior art keywords
- attention
- texture
- layer
- blocks
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 239000013598 vector Substances 0.000 claims abstract description 16
- 238000010276 construction Methods 0.000 claims abstract description 4
- 238000004364 calculation method Methods 0.000 claims description 22
- 238000013507 mapping Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 11
- 239000013307 optical fiber Substances 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 230000008447 perception Effects 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 4
- 239000000126 substance Substances 0.000 claims description 4
- OAICVXFJPJFONN-UHFFFAOYSA-N Phosphorus Chemical compound [P] OAICVXFJPJFONN-UHFFFAOYSA-N 0.000 claims description 2
- 230000008439 repair process Effects 0.000 abstract description 15
- 230000001427 coherent effect Effects 0.000 abstract description 2
- 238000012545 processing Methods 0.000 abstract description 2
- 238000013527 convolutional neural network Methods 0.000 description 12
- 238000013459 approach Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000009792 diffusion process Methods 0.000 description 6
- 230000004913 activation Effects 0.000 description 5
- 238000012549 training Methods 0.000 description 4
- 230000003044 adaptive effect Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000945 filler Substances 0.000 description 2
- 230000001788 irregular Effects 0.000 description 2
- 230000000873 masking effect Effects 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003042 antagnostic effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- FFBHFFJDDLITSX-UHFFFAOYSA-N benzyl N-[2-hydroxy-4-(3-oxomorpholin-4-yl)phenyl]carbamate Chemical compound OC1=C(NC(=O)OCC2=CC=CC=C2)C=CC(=C1)N1CCOCC1=O FFBHFFJDDLITSX-UHFFFAOYSA-N 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 229910052751 metal Inorganic materials 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011158 quantitative evaluation Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/40—Analysis of texture
- G06T7/41—Analysis of texture based on statistical description of texture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G06T5/77—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Abstract
The invention discloses an image restoration method based on global texture and structure, which relates to the field of image processing and comprises the following steps: inputting an image to be repaired, and acquiring a texture reference set of the image to be repaired; the method comprises the following steps of filling subsequent shielding blocks by using a known area and the shielding blocks which are already filled roughly as conditions, putting the new shielding blocks into the conditions after the new shielding blocks are filled roughly, and continuing to help the subsequent filling, wherein the method specifically comprises the following steps: selecting a reference vector from the texture reference set, repairing the rough filling block, and calculating the attention score between the texture reference set and the rough filling block; and reconstructing the rough filling blocks by using the bridging attention module and the attention fraction, obtaining a corpus set after multilayer construction, and selecting the candidate block with the strongest correlation from the corpus set to obtain the final restoration output. The repair output obtained by the method is semantically more coherent.
Description
Technical Field
The invention relates to the field of image processing, in particular to an image restoration method based on global texture and structure.
Background
The image restoration is a technology for restoring the shielded area of the image, and supports various applications such as image editing and restoration. Pioneering diffusion-based and tile-based approaches can only repair smaller sized masked regions with simple pixel-level color information, and cannot capture the high-level semantics of the repaired region. To address this problem, a great deal of attention has been directed to deep models, where models based on Convolutional Neural Networks (CNNs) learn high-level semantic information following the encoder-decoder architecture. However, the partial induction priors of CNNs only receive fill information of a bounded known region within the local spatial extent of the masking region.
To solve this problem, attention-based models have been proposed. In particular, the occlusion region expressed in units of blocks is first filled with coarse content as a query for all known small blocks in the image, and then a candidate block with a higher score is selected for replacement. It is noted that PENNet proposes a cross-layer attention module, which calculates attention scores of deep feature maps, performs block replacement on the bottom-layer feature maps according to the attention scores, and finally obtains the output of the repair result through upsampling. Although it considers all known patches in the whole image, each known patch is considered independently on the occlusion region, which strategy can mislead occlusion patches to be embedded by only one known dominant patch with the largest attention score, resulting in an unsatisfactory repair output.
Similar to the attention-based approach, the transform-based model also considers information from all known regions. Rather than focusing on the patch pool, it is based on the pixel level, where each pixel of the occlusion region de-energizes a pixel of a known region as a query to be reconstructed, then projected further into a color vocabulary library to select the most relevant color for repair. The repaired pixels are then added to the pool of known pixels and the process repeats until all pixels are repaired in the predefined order. Technically, BAT and ICT propose a decoder converter that captures pixel-level structure priors through a dense attention mechanism module and projects them into a visual color corpus to select the corresponding color. On the one hand, it explores all known regions, rather than determining only limited known regions, and is therefore superior to the attention model; on the other hand, the pixel level does not capture semantics as well as the patch level, and is therefore inferior to the attention model. Furthermore, the attention score is obtained using only the location information, far from the texture semantic level. Furthermore, the transform model computes a large number of pixels, which may lead to computational burden due to the secondary complexity of the self-attention module.
Still further from a texture and structure perspective, the above methods can be essentially divided into two categories: one is a pure texture approach and the other is a structure-texture based approach. Since pure texture methods, such as CNNs-based and attention-based models, rely heavily on known texture information to recover the masked regions, ignoring the structure may result in reasonable textures that cannot be recovered; worse yet, the texture information used for the repair comes only from the bounded known area, not the entire image, and thus does not capture the semantic correlation between textures in the global image well. In contrast, the structure-texture based approach aims at generating better texture semantics for occlusion regions guided by structural constraints. Then, texture recovery is performed through a different upsampling network. In summary, their core problem is how to fill the mask area with structure information.
EdgeConnect restores the edge information to the structural information based on the edge map and the occlusion black and white map through CNNs. And combining the repaired edge image with the shielded real image containing the texture information, and recovering the shielded area through a codec model. The EII adopts a CNNs model to reconstruct the shielding area of a black-and-white image as structural constraint, and on the basis, color information is used as a texture flow to be transmitted in the image through multi-scale learning. The MEDFE follows an encoder-decoder architecture, where the goal of the encoder is to equalize structural features from deep layers of CNNs with texture features from shallow layers of CNNs, through a channel and spatial equalization process, and then as input feedback to the decoder to generate a complement image. Although structural information can be captured intuitively, the information of all known blocks is not utilized, and therefore is called a "pseudo-global structure", which may mislead to non-ideal texture recovery compared to the transform model. CTSDG recently proposed that a structure and texture can be guided to each other by a two-stream structure based on U-Net variants. However, it may use local textures to guide global structures, thereby creating blurring artifacts. On the basis, how to generate the global texture and structure information can well utilize the semantics of the whole image, and how to match the two types of global information is very beneficial to image restoration.
Disclosure of Invention
In view of this, the present invention provides an image inpainting method based on global texture and structure, so as to solve the problems existing in the background art.
In order to achieve the purpose, the invention adopts the following technical scheme:
an image restoration method based on global texture and structure comprises the following steps:
inputting an image to be repaired, and acquiring a texture reference set of the image to be repaired;
filling subsequent shielding blocks by using the known area and the shielding blocks which are already coarsely filled as conditions, putting the new shielding blocks into the conditions after the coarse filling of the new shielding blocks, and continuing to help the subsequent filling, specifically comprising the following steps:
selecting a reference vector from the texture reference set, repairing the rough filling block, and calculating the attention score between the texture reference set and the rough filling block;
and reconstructing the rough filling blocks by using the bridging attention module and the attention fraction, obtaining a corpus set after multilayer construction, and selecting the candidate block with the strongest correlation from the corpus set to obtain the final restoration output.
Optionally, the calculation formula of the bridge attention module is as follows:
whereinRepresenting a bridging attention modeThe number of the blocks is such that,is a learnable linear mapping matrix, d c ,d r Is the dimension of the optical fiber, and the dimension,is a texture reference set; utilizing coarse structural informationAs a query to contact a set of known blocksPerforming an attention calculation, knowing the set of blocksEach value in (a) is as a query to andattention calculations are performed so that the coarse structure information can be reconstructed finally
Optionally, the attention score calculation formula is as follows:
wherein the content of the first and second substances,is a direct calculationAndattention score of;is a learnable linear mapping matrix, d i Is a dimension.
Optionally, the candidate block association probability is calculated as follows:
wherein, O t-1 A known area is represented by a known area,is a texture reference set;direct calculation for Mth layerAnd withThe point of attention in between is,utilizing a bridging attention module for an Mth layerAndinter attention score, λ represents weight, | · | | non-calculation 1 Is obtained by adding all attention scores related to the texture reference to assist in the reconstructionTo obtain correspondingN C Is the number of the middle elements. Picking the most relevant candidateAs a result of the t-th roundBy selecting atUtilize | | · | live in 1 The sum of the calculated maximum attention scores.
Optionally, the coarse padding block calculation formula is as follows:
wherein, d m Is the dimension of the optical fiber, and the dimension,andis a learnable linear mapping matrix; utilizing unoccluded tiles and remaining coarsely filled tiles by attention mechanismFormed set P k-1 Finally, add the coarsely filled blocksTo further form a set P k 。
Optionally, the texture reference set of the image to be repaired is obtained based on a Transformer encoder structure, where the Transformer encoder structure includes N layers, and each layer has a multi-head self-attention MSA and a feed-forward network FFN.
Optionally, for the first layer of the transform encoder, there are:
whereinAn input of the l-th layer is represented,an intermediate result of the l-th layer is shown,represents the input representing layer l +1, LN (-) represents the layer normalization, FFN (-) consists of two fully connected layers, each layer in turn consisting of two sublayers; the process reconstructs each r for MSA (·) T The global semantic association is captured by a multi-headed self-attention module, which two fully connected layers then convert to an input of l +1 layer directing the final layer to end.
Optionally, the formula for calculating the multi-head attention mechanism of the ith layer is as follows:
where h is the number of multiple heads, d l Is the dimension of the optical fiber, and the dimension,is 3 learnable mapping matrixes, j is more than or equal to 1 and less than or equal to h, W l Representing a learnable fully connected layer, fusing outputs from different heads; after passing through the encoder layer, each texture feature vector r is reconstructed T As a reference vectorThus assembled as a texture reference set
An overall loss function is also included and minimized to train the overall Transformer model:
wherein, the first and the second end of the pipe are connected with each other,which is indicative of a loss of the reconstruction,the loss of perception is indicated by the presence of,expressing style loss, set λ r =10;λ p 0.1 and λ s 250 and finally by an upsample operation I' out Up-sampling to final result I out 。
Compared with the prior art, the image restoration method based on the global texture and the structure has the following beneficial effects that:
1. a Transformer model comprising an encoder and a decoder is provided, wherein the aim of the encoder module is to capture semantic correlation of the whole image in texture reference, so as to obtain a global texture reference set; a rough filling attention module is designed, and the masking area is filled by using all known image blocks to obtain global structure information.
2. In order to make the decoder have the capability of combining the advantages of the two worlds of global texture reference and structure information, a structure-texture matching attention module is configured on the decoder in an intuitive attention transfer mode, and the module dynamically establishes an adaptive block vocabulary for the filled blocks on the occlusion region through a probability diffusion process.
3. To reduce the computational burden, several training techniques are disclosed to overcome the memory overhead of gpu while achieving the most advanced performance in typical benchmarking.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is an overall block diagram of the present invention;
FIG. 2 is a schematic diagram of a rough filled occlusion region according to the present invention;
FIG. 3 is a diagram of the overall architecture of the transform decoder of the present invention;
FIG. 4 is a schematic diagram of a bridge module according to the present invention;
FIG. 5 is a schematic diagram of bridge attention score increment update of the present invention
FIG. 6 is a graph comparing the results of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses an image restoration method based on global texture and structure, which adopts a transform model matched with an encoder and a decoder in order to well capture the global semantic correlation of all blocks in the whole image from the texture, wherein the encoder encodes the correlation of texture information of all blocks in a whole self-attention module, and each small block is extracted by CNNs to be a point on a characteristic graph so as to represent the semantics. In this way, the texture information of each point is represented as a texture reference vector (hereinafter simply referred to as a texture reference) as a query for reconstructing all other texture references. In other words, each texture reference encodes a different attention score for semantic relevance to all other textures in the full graph, resulting in a global texture reference. The goal of the transform decoder is to draw all occlusion blocks by all texture references. To this end, the present embodiment develops a coarse fill attention module that initially fills all of the occlusion blocks by using all of the known blocks. This embodiment is more prone to use all known patches in the image to obtain their global structure information, as opposed to their inaccurate coarse texture information. In conjunction with all texture references with global semantic relevance, the present embodiment proposes a new structure-texture matching attention module consisting of all known patches, wherein the structure information of each occlusion patch is used as a query to process all known patches, and each known patch is used as a query to process all texture references. In this way, the best match in both worlds can be achieved by using this transition and an adaptive block vocabulary consisting of repaired tiles to progressively cover all the occluded blocks through a process of probability diffusion. The overall model is shown in fig. 1.
The method specifically comprises the following steps:
inputting an image to be repaired, and acquiring a texture reference set of the image to be repaired;
filling subsequent shielding blocks by using the known area and the shielding blocks which are already coarsely filled as conditions, putting the new shielding blocks into the conditions after the new shielding blocks are coarsely filled, and continuing to help the subsequent filling, wherein the method specifically comprises the following steps:
selecting a reference vector from the texture reference set, repairing the rough filling block, and calculating the attention score between the texture reference set and the rough filling block;
and reconstructing the rough filling blocks by using the bridging attention module and the attention fraction, obtaining a corpus set after multilayer construction, and selecting the candidate block with the strongest correlation from the corpus set to obtain the final restoration output.
The purpose of image restoration is to input imagesAnd an occlusion picture M of the same size, (where the value of M is not 0, i.e. 1.) both get the occlusion picture I using elemental multiplication m =I gt An image is repaired to obtain a complete image I out The process of (1).
To capture the texture semantic relevance of the entire image, an explicit texture representation of each patch needs to be learned. In particular, a high-level semantic feature map may be generated by a typical CNNs network, ResNet50 Each point in the feature map is I m Corresponding to a piece of texture information of the original image. Obviously, if the feature size is as large as 32 × 32, it will result in shallow layers not capturing high-level semantics; if the network is deep and the size of the feature map becomes 8 x 8, it means that each point of the feature map carries too much semantics, resulting in the texture information of one block being mixed with the texture information of the other blocks. Thereby, for balancing, setting Let the dimension value of each feature point be C, i.e. the output dimension of ResNet50Degree 2048. it is then mapped to a low-dimensional vector representation r T Dimension d of E The calculation method is to merge feature maps of 2048 channels, perform 256 convolution processes with 1 × 1, and restore the feature maps to 256 againIn the form of (1). In order to preserve spatial order information, each point on the feature map is added with a corresponding position embeddingThus forming the final input form E for the encoder T 。
Ready to use E T Calculating texture correlations is self-attentive by calculation throughout the picture. The encoder structure based on Transformer comprises N layers, each layer has a multi-headed self attention (MSA) and feed-forward network (FFN). For the l-th layer there are:
whereinAn input of the l-th layer is represented,an intermediate result of the l-th layer is shown,representing the input representing the l +1 th layer, LN (-) represents the layer normalization, and FFN (-) consists of two Fully Connected (FC) layers, each in turn consisting of two sublayers. The main process is to reconstruct each r by MSA (-) T The global semantic association is captured by a multi-headed self-attention module, which two fully connected layers then convert to an input of l +1 layer directing the final layer to end. Where residual connections are used around each sub-layer. The formula for calculating the multi-head attention mechanism of the first layer is as follows:
where h is the number of multiple heads, d l Is the dimension of the optical fiber, and the dimension,is 3 learnable mapping matrixes, j is more than or equal to 1 and less than or equal to h, W l Representing a learnable fully connected layer, fusing the outputs from different heads. After passing through the encoder layer, each texture feature vector r can be reconstructed T As a reference vectorSo that they can be collected as a reference setIt is easy to see that eachAll other global texture correlations are encoded, where the texture correlations are different at different locations.
Except that a texture reference set is acquiredIt is also crucial to use these features to express how to repair occlusion patches. Unlike existing pixel-level decode-converters, the block size needs to be considered for semantic better matchingWill I m Down-sampling to obtain low-definition imagesSo as to enhance the corresponding global structure information and obtain a proper block size, I' m Spread into a 2D sequence of blocks Each block size is P, N 0 Is the number of blocks, then the blocks are flattened and mapped to d by a learnable linear mapping matrix D The dimension is the dimension of the block. Additional spatial locality embedding, whether for known or unknown patchesIs added to the expanded block to preserve spatial order.
In the discussion ofBefore associating with the occlusion region, coarse information is obtained based on the known region to fill the occlusion region. Unlike previous regions that relied solely on local inductive priors of CNNs to fill in the coarse content with known patches, a global fill attention mechanism was proposed to fill in the coarse content with all known blocks in the image. For ease of understanding, it is illustrated in FIG. 2 for the k-th block m k A coarse fill is performed with the unmasked blocks and the first k-1 blocks coarsely filled blocks. Specifically, first, all occluded blocks are sorted in ascending order by the proportion of the occlusion to fill in coarse content, and they are reconstructed with unoccluded blocks and the remaining coarse-filled blocks by the attention mechanism Formed set P k-1 Finally, add the coarsely filled blocksTo further form P k . Coarse filling blockThat is at P k-1 Reconstruction of m by means of upper attention k As a result, the calculation formula is:
wherein d is m Is the dimension of the optical fiber, and the dimension,andis 3 learnable linear mapping matrices. Now discuss how to get fromIn which a suitable one is selectedTo repair each blockIt is observed thatOr alsoIs formed using the non-occluded areas of the entire image. It is clear that the following description of the preferred embodiments,can capture global texture information well and fill in more coarseThe texture information of (2) is more accurate. However, repaired by all unoccluded blocksThe structure information of (a) is more excellent and the down-sampling operation is further enhanced. Actuated by this, except for direct use by attention mechanismsDe-reconstructionIt is also proposed to use the unoccluded block containing both the desired texture and structure as a bridging module for better matchingAndthe details are shown in figure 4.
For an M-layer decoder, each layer contains two sublayers, a texture matching attention (STMA) module and an FFN function containing a 2-layer fully-connected layer, for converting the results of the attention mechanism to the input of the (l +1) -th layer, ending with the M-th layer as with the encoder. The residual form is also sampled for concatenation. For theThe calculation method at the l-th layer is as follows:
wherein STMA (-) represents the structure texture matching attention by including the repaired block Known set of blocks of t-1 To obtainAnd withAttention score of (1). STMA (-) of layer l is calculated by
WhereinIs a direct calculationAndattention score of (1).Is a learnable linear mapping matrix, d i Is a dimension. As mentioned previously, the asperity structure information is providedAndit is not good to match directly, so it is proposed that the bridging attention module is based on the non-occluded blocks, thereby indirectly utilizingDe-reconstructionThe way the layer l is calculated is as follows:
whereinA bridge-attention module is represented that,is a learnable linear mapping matrix, d c ,d r Is a dimension. Where equation 6 implies an attention-shifting operation. By passing throughAs a query toThe calculation of the attention is carried out,each value in (a) is as a query to andattention calculations are performed so that the final reconstruction is possibleHas not passed through reconstructionRe-reconstructionSince it is clear that the known block is a very ideal real value, it does not need to be writtenThe reconstruction is performed, as with equations 5 and 6:
whereinAndis a learnable linear mapping matrix. Thus, a block-level decoder corpus is also needed to pick out the repaired blocks for each block. In particular, each coarsely filled patchReconstructed from equation 7 and after the Mth layer, becomesThus collecting to form corpus m C Selecting the candidate block with the strongest association from the corpusNamely, the final repair output with the highest probability is calculated and selected through the formula 8.
Wherein the content of the first and second substances, is a 256-dimensional vector output through the last Mth layer, the ith entryAndcoarse filling block for representing z-th blockAnd ith texture referenceThe resulting attention scores are calculated by equations 5 and 6, respectively. I | · | purple wind 1 Is to put all texture entriesAre added to help reconstruct allTo obtain m c In (b) correspond toN c Is m C The number of the elements in (B). Picking the most relevant candidateAs the result of round t-By selecting atUtilize | · | non-conducting phosphor 1 The sum of the calculated maximum attention scores. I.e. for the differencesBy usingThe sum of the calculated attention scores. And using equation 8 to set O of known blocks through a probability diffusion process t-1 Is expanded to O t Thereby further helping to select the candidate repair result of the t +1 th round, and finally ending with all the areas being repaired. Decoder block corpus m C Is based on the adaptation of the repair result of the coarse filler blocksConstructed and dynamically updated. The overall architecture of the transform decoder is shown in fig. 3.
Calculating efficiency: one may be concerned about the computational complexity incurred by the attention module for each iteration. As shown in fig. 5, which demonstrates its effectiveness, a map of the attention scores between the set of coarse filler blocks, the set of known blocks, and the texture reference set is saved after computation. When a coarsely filled block is restored to a repaired block, the attention score map between different sets need not be recalculated, only the row of the score map corresponding to the repaired block, i.e., the coarsely filled block shown in fig. 5(a) and (b), needs to be removed, and the attention scores between the new repaired block and the texture reference set are also supplemented in fig. 5 (c). In addition, corpus m is repaired C After most of the candidate blocks, the probability diffusion process in equation 8 need not be repeated, especially for those blocks containing only few coarse fills, only averaging by the content of the surrounding area is needed in order to reduce the complexity of the attention mechanismAnd m K 。
After all the occluded blocks are repaired, a reconstructed vector set is obtained, each vector is in 384-dimensional feature space, and the vectors need to be further restored to be an RGB imageBased on past work, several typical loss functions are selected to measure the repair picture I' out And downsampled real picture I' gt Reconstruction errors in between. Such as reconstruction lossLoss of perceptionLoss of styleAnd pairResistance to lossAfter that, by usingTrained antagonistic neural network to' out Up sampling toTherefore, the temperature of the molten metal is controlled,will not occur during training of the Transformer, training results in satisfactory I' out The process of (2) is as follows:
reconstruction loss: by means of 1 Loss-weighted downsampled true value I' gt And model repair result I' out Difference between pixels:
loss of perception: to simulate human perception of image quality, the perceptual loss is calculated by defining a distance metric between the activation profile of a pre-trained network to the restoration output and the true value, as follows:
wherein phi i Is obtained from the ith layer of VGG and has a size of N i =C i *H i *W i Characteristic diagram of (phi) i Represents the results of ReLu1_1, ReLu2_1, ReLu3_1, ReLu4_1 and ReLu5_ 1.
Style loss: the activation signature described above in equation 10 is further used to calculate a style loss to measure the difference between the covariances of the activation signatures and to mitigate "checkerboard" artifacts. Given the VGG level j activation feature map, the style loss formula is as follows:
Overall loss: based on the above, the overall loss function shown in equation 12 can be finally obtained and minimized to train the overall Transformer model:
in this example, we set λ r =10;λ p 0.1 and λ s 250 and finally by an upsample operation I' out Up-sampling to final result I out 。
The method proposed in this example is implemented in Python and pytorech. Training with an AdamW optimizer; learning rates of the Transformer and the feature extractor are respectively set as 10e-4 and 10e-5, and weight attenuation is 10 e-4. All Transformer weights are initialized with xavieriit, with ResNet50 pre-trained with Imagenet in torchbios, and fixed batch normalization layer. We also improve feature resolution by enlarging the hole values of the last stage convolution and removing the step size from the first convolution of that stage. Both the transform encoder and decoder include four layers. The network was trained using 256 × 256 size images containing irregular occlusions, and we performed experiments on three common datasets with different features: paris StreetView (PSV), CelebA-HQ and Places 2. We trained the Transformer using 2 NVIDIA 2080TI GPUs in batch size 32 for PSV and 4 NVIDIA 2080TI GPUs in batch size 64 for CelebA-HQ and Places 2.
Quantitative evaluation of our proposed method and latest technology based on four evaluation indexes 1) L 1 An error; 2) peak signal-to-noise ratio (PSNR); 3) index of structural similarity(SSIM) and 4) FID. L is a radical of an alcohol 1 PSNR and SSIM are used to compare the low-level differences at the pixel level between the generated image and the true value. The FID evaluates the perception result by measuring the feature distribution distance between the generated image and the real image. The irregular masked areas in the image are verified at different scales across the image size.
Quantitative comparison: we compare our approach with the latest approach: 1) CNNs pure texture method: GC and PIC; 2) attention-based methods: HiFill; 3) structure-texture based approach: MEDFE, EC, CTSDG, EII and decode converter based methods: ICT and BAT. As can be seen from Table 1, our process has a smaller L than the previous process 1 Error and FID scores, larger PSNR and SSIM. In particular, a small FID score verifies the advantages of the global texture reference and the structural feature representation. Since GC and PIC are texture-only methods, they fill the occluded region only with a known bounded region. HiFill calculates the similarity between each coarsely filled block and all known blocks independently, misleading that an occluded block is covered by only one explicit known region. Although the MEDFE intuitively captures the structure information, it fails to utilize the information of all known patches. Similar limitations apply to EC, CTSDG and EII. BAT and ICT recover the masked regions at the pixel level according to existing principles, and do not capture global texture semantics well. Our method is therefore superior to other methods.
And (4) qualitative comparison: to further clarify the observations, fig. 6 shows the visualization of all the methods on the three data sets. It can be seen that the repair output from our method is semantically more coherent based on the surrounding known regions.
User study: we further conducted a user survey on the data sets PSV, CelebA-HQ and Places 2. Specifically, we randomly drawn 20 test images from each dataset, inviting a total of 10 volunteers to select the most realistic image from the repair results produced by the proposed method and some of the latest methods. As shown in the last column of table 1, the results of our method far exceeded the state of the art.
The embodiment introduces a global idea in texture and structure information of image restoration. In the art, a transform model is proposed in which an encoder and a decoder are combined. The encoder aims to obtain the global texture semantic correlation of the whole image, and the decoder module recovers the covered area. An adaptive block vocabulary is built, and all the coarsely filled blocks are gradually covered through a probability diffusion process. The experimental results of the benchmark tests verify the advantages of our model over the most advanced work.
TABLE 1
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (9)
1. An image restoration method based on global texture and structure is characterized by comprising the following steps:
inputting an image to be repaired, and acquiring a texture reference set of the image to be repaired;
filling subsequent shielding blocks by using the known area and the shielding blocks which are already coarsely filled as conditions, putting the new shielding blocks into the conditions after the new shielding blocks are coarsely filled, and continuing to help the subsequent filling, wherein the method specifically comprises the following steps:
selecting a reference vector from the texture reference set, repairing the rough filling block, and calculating the attention score between the texture reference set and the rough filling block;
and reconstructing the rough filling blocks by using the bridging attention module and the attention fraction, obtaining a corpus set after multilayer construction, and selecting the candidate block with the strongest correlation from the corpus set to obtain the final restoration output.
2. The method according to claim 1, wherein the calculation formula of the bridge attention module is as follows:
whereinA bridge-attention module is represented that,is a learnable linear mapping matrix, d c ,d r Is the dimension of the optical fiber, and the dimension,is a texture reference set; utilizing coarse structural informationAs a query to contact a set of known blocksPerforming an attention calculation, knowing the set of blocksAs a query to andattention calculations are performed so that the coarse structure information can be reconstructed finally
4. A method for global texture and structure based image inpainting as claimed in claim 1, wherein the candidate block association probability is calculated as follows:
wherein, O t-1 A set of known blocks is represented by a representation,is a texture reference set;direct calculation for Mth layerAndthe point of attention in between is,utilizing a bridging attention module for an Mth layerAndinter attention score, λ represents weight, | · | | non-calculation 1 Is obtained by adding all attention scores related to the texture reference to aid in the reconstructionTo obtain the correspondingN C Is the number of the elements in the list. Picking the most relevant candidateAs a result of the t-th roundBy selecting atUtilize | · | non-conducting phosphor 1 The sum of the calculated maximum attention scores.
5. The method according to claim 1, wherein the coarse padding block is calculated according to the following formula:
6. The method of claim 1, wherein a transform-based encoder structure obtains a texture reference set of the image to be restored, wherein the transform-based encoder structure comprises N layers, and each layer has a multi-headed self-attention MSA and a feed-forward network FFN.
7. The global texture and structure-based image inpainting method as claimed in claim 6, wherein for the transform encoder layer I:
whereinAn input of the l-th layer is represented,the intermediate result of the l-th layer is shown,represents the input representing layer l +1, LN (-) represents the layer normalization, FFN (-) consists of two fully connected layers, each layer in turn consisting of two sublayers; the process reconstructs each r for MSA (·) T The global semantic association is captured by a multi-headed self-attention module, which two fully connected layers then convert to an input of l +1 layer directing the final layer to end.
8. The method of claim 7, wherein the formula for calculating the l-th layer multi-head attention mechanism is as follows:
where h is the number of multiple heads, d l Is the dimension of the optical fiber, and the dimension,is 3 learnable mapping matrixes, j is more than or equal to 1 and less than or equal to h, W l Representing a learnable connected layer, fusing outputs from different heads; after passing through the encoder layer, each texture feature vector r is reconstructed T As a reference vectorThus assembled as a texture reference set
9. The method of claim 1, further comprising a global loss function, and minimizing the global loss function to train a global transform model:
wherein the content of the first and second substances,which is indicative of a loss of the reconstruction,the loss of perception is indicated by the presence of,expressing style loss, set λ r =10;λ p 0.1 and λ s 250 and finally I 'by an upsampling operation' out Up-sampling to a final result I out 。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210535815.4A CN115035170B (en) | 2022-05-17 | 2022-05-17 | Image restoration method based on global texture and structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210535815.4A CN115035170B (en) | 2022-05-17 | 2022-05-17 | Image restoration method based on global texture and structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115035170A true CN115035170A (en) | 2022-09-09 |
CN115035170B CN115035170B (en) | 2024-03-05 |
Family
ID=83121173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210535815.4A Active CN115035170B (en) | 2022-05-17 | 2022-05-17 | Image restoration method based on global texture and structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115035170B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115908205A (en) * | 2023-02-21 | 2023-04-04 | 成都信息工程大学 | Image restoration method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6830707B1 (en) * | 2020-01-23 | 2021-02-17 | 同▲済▼大学 | Person re-identification method that combines random batch mask and multi-scale expression learning |
CN113469906A (en) * | 2021-06-24 | 2021-10-01 | 湖南大学 | Cross-layer global and local perception network method for image restoration |
US20210390700A1 (en) * | 2020-06-12 | 2021-12-16 | Adobe Inc. | Referring image segmentation |
-
2022
- 2022-05-17 CN CN202210535815.4A patent/CN115035170B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6830707B1 (en) * | 2020-01-23 | 2021-02-17 | 同▲済▼大学 | Person re-identification method that combines random batch mask and multi-scale expression learning |
US20210390700A1 (en) * | 2020-06-12 | 2021-12-16 | Adobe Inc. | Referring image segmentation |
CN113469906A (en) * | 2021-06-24 | 2021-10-01 | 湖南大学 | Cross-layer global and local perception network method for image restoration |
Non-Patent Citations (1)
Title |
---|
邵杭;王永雄;: "基于并行对抗与多条件融合的生成式高分辨率图像修复", 模式识别与人工智能, no. 04, 15 April 2020 (2020-04-15) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115908205A (en) * | 2023-02-21 | 2023-04-04 | 成都信息工程大学 | Image restoration method and device, electronic equipment and storage medium |
CN115908205B (en) * | 2023-02-21 | 2023-05-30 | 成都信息工程大学 | Image restoration method, device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN115035170B (en) | 2024-03-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111784602B (en) | Method for generating countermeasure network for image restoration | |
CN109087273B (en) | Image restoration method, storage medium and system based on enhanced neural network | |
CN114463209B (en) | Image restoration method based on deep multi-feature collaborative learning | |
CN114627006B (en) | Progressive image restoration method based on depth decoupling network | |
CN114445292A (en) | Multi-stage progressive underwater image enhancement method | |
CN113538234A (en) | Remote sensing image super-resolution reconstruction method based on lightweight generation model | |
CN115170915A (en) | Infrared and visible light image fusion method based on end-to-end attention network | |
CN116485741A (en) | No-reference image quality evaluation method, system, electronic equipment and storage medium | |
CN110874575A (en) | Face image processing method and related equipment | |
CN116757986A (en) | Infrared and visible light image fusion method and device | |
CN112163998A (en) | Single-image super-resolution analysis method matched with natural degradation conditions | |
CN116739899A (en) | Image super-resolution reconstruction method based on SAUGAN network | |
CN115035170A (en) | Image restoration method based on global texture and structure | |
CN113379606B (en) | Face super-resolution method based on pre-training generation model | |
CN112686830B (en) | Super-resolution method of single depth map based on image decomposition | |
Yu et al. | MagConv: Mask-guided convolution for image inpainting | |
CN116109510A (en) | Face image restoration method based on structure and texture dual generation | |
CN115578638A (en) | Method for constructing multi-level feature interactive defogging network based on U-Net | |
CN114862696A (en) | Facial image restoration method based on contour and semantic guidance | |
Kumar et al. | Underwater Image Enhancement using deep learning | |
CN114359180A (en) | Virtual reality-oriented image quality evaluation method | |
CN117196981B (en) | Bidirectional information flow method based on texture and structure reconciliation | |
CN116523985B (en) | Structure and texture feature guided double-encoder image restoration method | |
CN113688694B (en) | Method and device for improving video definition based on unpaired learning | |
Peng et al. | RAUNE-Net: A Residual and Attention-Driven Underwater Image Enhancement Method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |