CN118014894A

CN118014894A - Image restoration method, device, equipment and readable storage medium based on combination of edge priors and attention mechanisms

Info

Publication number: CN118014894A
Application number: CN202311365706.3A
Authority: CN
Inventors: 樊瑶; 王浩; 陈浩
Original assignee: Xizang Minzu University
Current assignee: Xizang Minzu University
Priority date: 2023-10-20
Filing date: 2023-10-20
Publication date: 2024-05-10

Abstract

The application provides an image restoration method, device and equipment based on the combination of edge priori and attention mechanism and a readable storage medium, belonging to the technical field of image restoration. In the edge prediction stage, the model is an efficient transducer-based edge prediction (TEP) module, and compared with the previous method, the edge structure of the defect area can be better obtained, and the calculation cost is reduced. In the second stage, the present application proposes a multi-scale fusion attention (MFA) module that can extract valid features on a multi-scale feature level, and then fill layer by layer from deep semantics to shallow details to enhance local pixel continuity. The present application compares the proposed method qualitatively and quantitatively with other advanced methods on CelebA, facade and Places2 datasets corrupted using the NVIDIA mask dataset. Experimental results show that the method has good performance in repairing complex large holes.

Description

Image restoration method, device, equipment and readable storage medium based on combination of edge priors and attention mechanisms

Technical Field

The invention relates to the technical field of image restoration, in particular to an image restoration method, an image restoration device, image restoration equipment and a readable storage medium based on combination of edge prior and attention mechanisms.

Background

Image repair is a challenging computer vision task aimed at constructing visually attractive and semantically sound content for missing regions. It has wide applications in the real world, such as repairing defective photographs, removing unwanted objects, and super resolution. Initially, the conventional diffusion-based method mainly gradually diffuses pixel information around a damaged area in an image, and synthesizes new textures to fill the holes. The method is suitable for repairing small-size damaged areas such as cracks, and the repairing result is too fuzzy along with the enlargement of the damaged areas. In addition, the original sample-based approach synthesizes textures by searching for matching similar sample blocks in the intact regions of the image and copying to the corresponding locations. Such methods tend to produce high quality textures, but may repair erroneous structures and semantics.

With the vigorous development of deep learning in image processing tasks, researchers at home and abroad introduce different deep learning techniques to solve the problems. The Convolutional Neural Network (CNNs) based method follows the encoder-decoder architecture and the countermeasure ideas of generating the countermeasure network (GANs) by learning advanced semantic features to reconstruct missing regions. However, the locality induced bias and limited receptive field of convolution operations make it difficult for models to learn globally semantically consistent textures. The method utilizes the attention mechanism to search the feature blocks most similar to the mask region in the feature space of the known region to realize the long-distance feature block matching, but for the image with a larger missing region, the attention mechanism can not provide enough information to repair, so that the texture detail of part of the repair result is fuzzy and more artifacts exist. Compared to CNN-based methods, the Transformer-based solution reconstructs a low resolution image using superior long correlation of the Transformer and extraction capability of global features, and then provides the reconstructed low resolution image to the CNN-based upsampler to recover texture details. However, the method ignores the importance of the whole image structure, so that the problems of inconsistent boundaries, semantic deletion and the like of the result are caused, and meanwhile, the model training and reasoning process has higher calculation and storage cost. In addition, some methods utilize edge, gradient or semantic information for structural restoration, for example Nazeri et al use Canny operators to extract edge information of defective areas, and the fine network uses the reconstructed edge map as structural information to continuously guide content restoration, so that the restoration effect of structural details is improved to a certain extent. In contrast, xiong et al used the predicted foreground contours to guide image completion in order to solve the problem of overlapping of the hole and foreground object, thereby ensuring the rationality of image content filling. However, the repair method based on the structural constraint has certain limitation, mainly ignores two important factors, namely 1, poor performance in the aspects of understanding global structures and the like due to the characteristics of invariance of the space of the convolutional neural network, local induction prior and the like, and causes poor edge repair effect. 2. Many factors affect the success of texture synthesis in image restoration. In addition to focusing on structural details, it is also important that the model effectively utilize remote features to capture rich contextual information.

Many methods assist in image restoration by adding a priori knowledge of the image structure, showing more detailed, more reasonable results. Liao et al have improved on the basis of CE and have proposed that edge-aware context encoders predict image edge structures to facilitate scene structure and context learning, but that there is significant blurring and artifacts around repaired defect areas. Cao et al learn the sketch tensor space consisting of edges, lines and connection points using the codec structure while introducing a gating convolution and attention module to improve local detail in a cost-efficient manner, but this design is not suitable for structural repair of scenes such as faces. Furthermore, some methods combine both image structure and texture details to guide image completion. Liao et al designed semantic segmentation guidance and evaluation mechanisms to interact to iteratively update semantic information and repair images, but it was difficult to obtain accurate semantic information for images with complex backgrounds. Guo et al share information on both texture generation and structure prediction and further enhance global consistency by fusing attention modules with learnable parameters. However, this method of coupling often lacks explicit structural details when restoring natural images of irregular defects. Specifically, the above-described structural constraint-based image restoration method ignores the positive effects of distant features when dealing with large-area irregular defects. Once the reconstructed structure is missing or wrong, the repairing effect is obviously deteriorated.

In view of this, although the image restoration method based on deep learning has made a great progress in reconstruction of damaged areas. However, in the face of missing images of large holes, repair results often suffer from structural distortion and texture blurring. Therefore, a new idea and solution are needed to solve the above-mentioned problems.

The details of the above publications are referred to the following references.

[1]Z.Wan,B.Zhang,D.Chen,P.Zhang,D.Chen et al.,"Bringing old photos back to life,"in Proc.of the IEEE/CVF Conf.on Computer Vision and Pattern Recognition,Seattle,WA,USA,pp.2747-2757,2020.

[2]C.Barnes,E.Shechtman,A.Finkelstein and D.B.Goldman,"PatchMatch:A randomized correspondence algorithm for structural image editing,"ACM Trans.Graph.,vol.28,no.3,pp.24:1-24:11,2009.

[3]F.Yang,H.Yang,J.Fu,H.Lu and B.Guo,"Learning texture transformer network for image super-resolution,"in Proc.of the IEEE/CVF Conf.on Computer Vision and Pattern Recognition,Seattle,WA,USA,pp.5791-5800,2020.

[4]M.Bertalmio,G.Sapiro,V.Caselles,and C.Ballester,"Image inpainting,"in Proc.of the 27th annual conf.on Computer graphics and interactive techniques,New York,NY,USA,pp.417-424,2000.

[5]T.F.Chan and J.Shen,"Nontexture inpainting by curvature-driven diffusions,"Journal of visual communication and image representation,vol.12,no.4,pp.436-449,2001.

[6]N.Komodakis and G.Tziritas,"Image completion using efficient belief propagation via priority scheduling and dynamic pruning,"IEEE Transactions on Image Processing,vol.16,no.11,pp.2649-2661,2007.

[7]A.Criminisi,P.Pérez and K.Toyama,"Region filling and object removal by exemplar-based image inpainting,"IEEE Transactions on image processing,vol.13,no.9,pp.1200-1212,2004.

[8]D.Pathak,P.Krahenbuhl,J.Donahue,T.Darrell and A.A.Efros,"Context encoders:Feature learning by inpainting,"in Proc.of the IEEE Conf.on Computer Vision and Pattern Recognition,Las Vegas,NV,USA,pp.2536-2544,2016.

[9]S.Iizuka,E.Simo-Serra and H.Ishikawa,"Globally and locally consistent image completion,"ACM Trans.Graph.,vol.36,no.4,pp.107:1–107:14,2017.

[10]Y.Wang,X.Tao,X.Qi,X.Shen and J.Jia,"Image inpainting via generative multi-column convolutional neural networks,"in Proc.of Advances in Neural Information Processing Systems,Montréal,QC,Canada,pp.331–340,2018.

[11]I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu and D.Warde-Farley,"Generative adversarial nets,"in Proc.of Advances in Neural Information Processing Systems,Montreal,QC,Canada,pp.2676–2680,2014.

[12]J.Yu,Z.Lin,J.Yang,X.Shen,X.Lu et al.,"Generative image inpainting with contextual attention,"in Proc.of the IEEE conf.on Computer Vision and Pattern Recognition,Salt Lake City,UT,USA,pp.5505-5514,2018.

[13]M.C.Sagong,Y.G.Shin,S.W.Kim,S.Park and S.J.Ko,"Pepsi:Fast image inpainting with parallel decoding network,"in Proc.of the IEEE Conf.on Computer Vision and Pattern Recognition,Long Beach,CA,USA,pp.11360-11368,2019.

[14]Z.Wan,J.Zhang,D.Chen and J.Liao,"High-fidelity pluralistic image completion with transformers,"in Proc.of the IEEE/CVF International Conference on Computer Vision,Montreal,QC,Canada,pp.4692-4701,2021.

[15]Y.Yu,F.Zhan,R.Wu,J.Pan and K.Cui,"Diverse image inpainting with bidirectional and autoregressive transformers,"in Proc.of the 29th ACM International Conf.on Multimedia,Chengdu,Sichuan,China,pp.69-78,2021.

[16]X.Guo,H.Yang and D.Huang,"Image inpainting via conditional texture and structure dual generation,"in Proc.of the IEEE/CVF International Conference on Computer Vision,Montreal,QC,Canada,pp.14134-14143,2021.

[17]J.Yang,Z.Qi,and Y.Shi,"Learning to incorporate structure knowledge for image inpainting,"in Proc.of the AAAI Conference on Artificial Intelligence,New York,NY,USA,pp.12605–12612,2020.

[18]Y.Song,C.Yang,Y.Shen,P.Wang Q.Huang et al.,"Spg-net:Segmentation prediction and guidance network for image inpainting,"arXiv:1805.03356,2016.[Online].Available:https://arxiv.org/abs/1805.03356.

[19]L.Liao,J.Xiao,Z.Wang,C.W.Lin and S.I.Satoh,"Guidance and evaluation:Semantic-aware image inpainting for mixed scenes,"in Proc.of European Conference on Computer Vision,Glasgow,Scotland,UK,pp.683-700,2020.

[20]K.Nazeri,E.Ng,T.Joseph,F.Z.Qureshi and M.Ebrahimi,"Edgeconnect:Generative image inpainting with adversarial edge learning,"arXiv:1901.00212,2019.[Online].Available:http://arxiv.org/abs/1901.00212.

[21]W.Xiong,J.Yu,Z.Lin,J.Yang,X.Lu et al.,"Foreground-aware image inpainting,"in Proc.of the IEEE/CVF International Conference on Computer Vision,Long Beach,CA,USA,pp.5840–5848,2019.

[22]Z.Liu,P.Luo,X.Wang and X.Tang."Deep learning face attributes in the wild,"in Proc.of the IEEE international conference on computer vision,Santiago,Chile,pp.3730-3738,2015.

[23]R.and R./>"Spatial pattern templates for recognition of objects with regular structure,"in Proc.of German Conference on Pattern Recognition,Saarbrücken,Germany,pp.364-374,2013.

[24]B.Zhou,A.Lapedriza,A.Khosla,A.Oliva and A.Torralba,"Places:A 10million image database for scene recognition,"IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.40,no.6,pp.1452–1464,2018.

[25]L.Liao,R.Hu,J.Xiao and Z.Wang,"Edge-aware context encoder for image inpainting,"in Proc.of 2018IEEE International conference on acoustics,speech and signal processing,Calgary,AB,Canada,pp.3156-3160,2018.

Disclosure of Invention

Aiming at the technical problems, the application aims to introduce an edge prediction module in an edge restoration stage, and an image restoration method (EPAM) combining an edge prior and an attention mechanism is adopted. The method divides the repair task into two phases: and the edge prediction and the image restoration utilize multi-level long-distance characteristics, so that the continuity of local pixels is enhanced, and the restoration quality of the image is obviously improved.

At the beginning of new ideas, the whole team of the inventors starts from the framework of the transducer itself, which was originally proposed as a sequence-to-sequence model for machine translation tasks, and later developed and applied to computer vision tasks such as object detection, video processing, image processing [28 ]. In recent years, the excellent performance of the transducer model has attracted researchers to use it to address image restoration issues. Wan et al used a transducer for image restoration for the first time, using bi-directional attention and mask language model targets in BERT-like to achieve low resolution image diversity appearance reconstruction. Zheng et al designed a content inference model of mask-aware transformers that extracted Token using a restrictive convolutional neural network module, and the transformers captured global context with alternate weighted self-attention layers, reducing the proximity dominant effects that led to semantic incoherent results. However, this approach does not understand and envision advanced semantic content. The method based on the transducer is mainly used for image reconstruction, a low-resolution repair image is obtained, and the method has the problem of long reasoning time.

It should be noted that, based on the analysis of the disclosed related art by the above inventor, it is derived that, for the structural reconstruction problem, the edge map can not only provide accurate structural information of the image, but also has a stronger resistance to interference factors such as noise, compared with the smooth image and the semantic image.

Therefore, the inventor tries to propose a novel TEP module, reconstruct the whole edge structure of an image by utilizing the capability of a transducer, overcome the problems brought by a CNN method, obtain more excellent edge prediction performance, and further, the application designs an MFA module aiming at the problems of unobtrusive texture details and inconsistent boundaries of a repair result. In general, the EPAM model carries out structure prediction and texture generation on the image in a decoupling mode, improves the characterization capability of the model, and ensures the consistency of the overall structure and detailed texture of the restored image.

Notably, based on the analysis and the new idea set forth above, the inventors finally conducted a final design study of the solution from two main lines, and proposed an image restoration framework (EPAM) that combines edge priors with a attentional mechanism. The present application predicts the edge structure of a defective area with excellent performance using a transducer architecture, which is suggested by the excellent performance of a transducer in the computer vision field. Then, in the image restoration stage, the application utilizes the effective edge structure to restrict the image content, and reconstructs the image with reasonable vision and clear texture.

Based on the above, the invention finally designs the following specific technical scheme:

more specifically, the first aspect of the present invention provides an image restoration method based on the combination of edge priori and attention mechanism, comprising an edge prediction stage and an image restoration stage; wherein,

Edge prediction stage

Acquiring gray defect images and incomplete edge information, and predicting reasonable edge contours in a defect area through a transducer framework combining axial attention and standard self-attention to obtain an edge prediction graph;

Image inpainting stage

And taking the edge prediction graph as a structure prior, combining with the defective RGB image, and synthesizing texture details in a local closed area surrounded by the edge through a multi-scale fusion attention module to finish image restoration.

As a specific implementation of the method according to the first aspect of the present invention, the edge prediction stage is implemented by an edge prediction network; the image restoration stage is realized through an image restoration network; wherein,

The edge prediction network includes an edge generator G ₁ and an edge discriminator D ₁, with G ₁ (·) representing the edge generator operation, then the edge prediction graph is represented as:

the image inpainting network includes an image generator G ₂ and an image arbiter D ₂, representing the operation of the image generator with G ₂ (), the predicted image is represented as:

obtaining a repair output identical to the original size:

Wherein, I _t,I_gs and E _t respectively represent an original image, a corresponding gray scale image and an edge structure diagram; the incomplete image is represented as The gray level of an incomplete image is expressed as/>Defect edges are denoted/>Wherein the element-by-element product operation is represented; e _comp denotes the synthesized edge prediction graph.

In the above technical solution, in order to specifically study the technical means of edge restoration, it is specifically defined that the edge generator is based on a self-encoder structure, and the edge prediction graph is completed by performing a decoding stage of encoder data compression, a feature reconstruction stage of bottleneck layer feature reconstruction, and a decoding stage of decoder decompression on given image features.

As a specific embodiment, the encoding stage, the feature reconstruction stage and the decoding stage respectively include the following steps:

In the encoding stage, the encoder firstly obtains the feature by using 7×7 convolution with the reflection filling parameter of 3 and the step length of 1, adjusts the given image feature to 256×256×64, and then obtains the shallow output feature with the size of 32×32×256 by convolution with the continuous three-layer step length of 2 and the convolution kernel size of 4×4;

In the feature reconstruction stage, eight transformers based on an axial attention mechanism are selected to form an information bottleneck layer, the characterization capability of feature information and the capturing capability of global structural information are enhanced, missing edge information is complemented, and a reconstruction feature with dimensions of 32 multiplied by 256 is obtained;

in the decoding stage, the characteristics are up-sampled to 256×256×64 by using 3 transposed convolutions with convolution kernel size of 4×4, zero padding operation of 1 and step size of 2, and then the output is adjusted to 256×256×1 by using 1 convolutions with convolution kernel size of 7×7, reflection padding parameter of 3 and step size of 1, so as to obtain the edge prediction graph.

Further, the edge discriminator comprises a convolution stage and a sample output stage, wherein,

A convolution stage, namely selecting PatchGAN layers of convolution structures, wherein the steps of the convolution structures are 2, 1 and 1 respectively, and the convolution kernel size is 4 multiplied by 4;

in the sample output stage, an input image is calculated into a single-channel feature map with the size of 30 multiplied by 30 through 5-layer convolution operation, and the Sigmod function is used for mapping the output into scalar in the range of [0,1], so that the true and false of the input sample can be effectively judged, and an edge repair result can be obtained.

More specifically, the second aspect of the present invention provides a product, specifically an image restoration device based on combination of edge priori and attention mechanism, comprising an edge prediction module and an image restoration module, wherein:

the edge restoration module is used for acquiring gray defect images and incomplete edge information, predicting reasonable edge contours in a defect area through a transducer framework combining axial attention and standard self-attention, and obtaining an edge prediction graph;

and the image repair module is used for combining the edge prediction graph as a structure prior and the defective RGB image, and synthesizing texture details in a local closed area surrounded by the edge through the multi-scale fusion attention module to finish image repair.

As a specific implementation manner of the product of the second aspect, the edge prediction module includes:

The coding unit, obtain the characteristic through the encoder first using the reflection to fill the 7 x 7 convolution with parameter 3, step length 1, will adjust the given image characteristic to 256 x 64 size, then through the convolution of consecutive three-layer step length 2, convolution kernel size 4 x 4, obtain the shallow layer output characteristic of the size 32 x 256;

The feature reconstruction unit is used for enhancing the characterization capability of feature information and the capturing capability of global structural information by stacking eight transformer structures based on an axial attention mechanism to form an information bottleneck layer, complementing the missing edge information and obtaining a reconstruction feature with the dimension of 32 multiplied by 256;

In the decoding stage, the characteristics are up-sampled to 256×256×64 by using 3 transposed convolutions with a convolution kernel size of 4×4 and a zero padding operation of 1 and a step size of 2, and then the output is adjusted to 256×256×1 by using 1 convolutions with a convolution kernel size of 7×7 and a reflection padding parameter of 3 and a step size of 1, so as to obtain an edge prediction map.

In the above technical solution, the edge prediction module further includes:

The convolution unit takes PatchGAN layers of convolution as a framework, the steps of the convolution unit are 2, 1 and 1 respectively, and the convolution kernel size is 4 multiplied by 4;

And the sample output unit calculates an input image into a single-channel characteristic diagram with the size of 30 multiplied by 30 through the 5-layer convolution operation, maps the output into a scalar in the range of [0,1] by using Sigmod functions, effectively judges whether the input sample is true or false, and acquires an edge restoration result.

More specifically, a third aspect of the present invention provides an image restoration device, including at least one processor and a memory, where the memory stores computer instructions, and the processor is configured to execute the computer instructions stored in the memory, so as to implement the steps of the image restoration method that combines the edge priors and the attention mechanisms.

More specifically, a fourth aspect of the present invention provides a computer program stored on a computer readable recording medium, the computer program being configured to perform the steps of the above-described image restoration method by combining edge priors with an attention mechanism.

Compared with the prior art, the invention has at least the following technical effects:

in summary, the contributions of the application are summarized below:

(1) In an edge prediction network, the application provides an efficient edge prediction (TEP) module based on a transducer, and the existing CNN-based method is remarkable in that the edge of a incomplete part is difficult to accurately extract, and the TEP module can realize more accurate and comprehensive structure recovery. In the TEP module, the application also utilizes the axial attention with relative position coding, improves the position sensing capability, obviously reduces the model complexity of the edge prediction network and balances the relationship between the performance and the efficiency.

(2) In an image restoration network, the application designs a multi-scale fusion attention (MFA) module. In particular, the module reduces the impact of redundant features by aggregating different levels of contextual feature information using different ratios of dilation convolutions while applying an efficient channel attention mechanism. In addition, a attention transfer network is introduced, so that the model fully fuses shallow texture details and deep semantic features, and further unreasonable or contradictory regions in the generated image are avoided. The final image restoration network can complete reconstruction of texture details under the guidance of an edge structure.

(3) The application performs a comparison experiment on the EPAM model and the existing advanced method. The experimental results on the Celeba, facade, places data set show that the EPAM model presents a competitive recovery result in both qualitative and quantitative evaluations.

Drawings

Fig. 1 is an overall network structure. Wherein: the upper half is the edge prediction network and the lower half is the image restoration network.

Fig. 2 is a block diagram of a transducer-based edge prediction module.

Fig. 3 is a graph comparing differences between unidirectional, bidirectional and axial attention.

Fig. 4 is a schematic diagram of a multi-scale fused attention module.

Fig. 5 is a two-stage training loss diagram of the EPAM model.

FIG. 6 is a qualitative comparison of repairing Facade irregular holes and central regular holes on a dataset.

FIG. 7 is a qualitative comparison of repairing irregular holes on Places2 dataset.

FIG. 8 is a visual comparison of the proposed method with other methods based on structure on CelebA, facade and Places2 datasets: (a) an input corrupted image; (b) (c) and (d) are the edge structures generated by EC, CTSDG and the method of the application, respectively; (e) (f), (g) are the corresponding repair effects of EC, CTSDG and the methods of the application, respectively; and (h) group-Truth.

Fig. 9 is a matrix of attention scores for ATNs at different scales.

Fig. 10 is an ATN feature map.

Fig. 11 is an analysis of different configurations of the proposed method.

Fig. 12 is a graph of accuracy and precision with Axial attention and without Axial attention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution of the embodiments of the present invention will be clearly and completely described below, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

The invention provides a novel image restoration framework based on the current research theory analysis

The application provides an image restoration model combining structure priori and attention mechanisms, and the overall model architecture is shown in figure 1. The proposed EPAM consists of two subnetworks. The upper half is the edge prediction network and the lower half is the image restoration network.

As can be seen from the figure, the model consists of two generated countermeasure networks in series; wherein,

The output result of the first stage generator is used as the input of the second stage generator, and the two-stage network integrally forms an end-to-end repair model. In the first stage, the edge prediction network predicts a reasonable edge contour in a defect area by using a gray defect image and incomplete edge information. The second stage then uses the edge prediction map as a structure prior, in combination with the defective RGB image, to synthesize the appropriate texture details within the partially enclosed region surrounded by the edges. And finally, finishing the repair task.

In the edge prediction stage, a self-encoder is used as a generator of the edge prediction network, and the PatchGAN architecture is used as a discriminator of the network. Because the convolutional neural network is difficult to acquire remote information, only surrounding pixel information is used for reconstructing a defect region, and the structural information of the center of the defect region is difficult to recover. In order to solve this problem, the present application proposes a TEP module and embeds it into the information bottleneck part of the self-encoder, notably that the TEP module does not use deep-stacked convolution layers, but creates equal flow opportunities for all visible pixels using a transform-based architecture to obtain a expressive global structure. In addition, the application introduces relative position coding and axial attention blocks in the TEP module, improves the spatial relationship and reduces the memory overhead. In an image restoration stage, a multi-scale fusion attention (MFA) module is provided for solving the problems of chromatic aberration, blurring and boundary distortion of an image after the restoration of the existing model based on an attention mechanism. The present application introduces serial MFA modules after the encoder. The MFA module captures deep features of different receptive fields by using dilation convolutions of different expansion rates to better integrate global context information with local detail information. Then, the application constructs an Attention Transfer Network (ATN) on the four feature maps with different scales, thereby enhancing local pixel continuity, capturing remote dependence and obviously improving the restoration quality of the image.

The edge prediction network comprises a generator G ₁ and a discriminator D ₁, and likewise the image restoration network consists of a generator G ₂ and a discriminator D ₂. I _t,I_gs and E _t represent the original image, the corresponding gray-scale map, and the edge structure map, respectively. In the binary mask I _M, a value of 1 indicates a hole region pixel and a value of 0 indicates other region pixels. The incomplete image is then represented asThe gray level of an incomplete image is expressed as/>Defect edges are denoted/>Where the element-by-element product operation is represented. G ₁ (·) is used to represent edge generator operations, then the edge prediction graph is represented as:

where E _comp represents the synthesized edge prediction graph, and G ₂ (), represents the operation of the image generator, the predicted image is represented as:

finally, obtaining a repair output which is the same as the original size:

the following focuses on the two-stage specific design process

1. For the design thought of the two stages, the application firstly designs the framework of the edge prediction stage, and the design is as follows:

1) Transformer-based edge prediction module design

The Transformer architecture was originally designed to address natural language processing (natural language processing, NLP) tasks, and is based entirely on self-attention, enabling direct modeling of longer-range dependencies between input sequences. Recently, researchers have applied it to computer vision tasks with significant success. Inspired by ViT, transformer decoder is introduced into a TEP module, and then edge information reconstruction is carried out according to shallow layer characteristics output by an encoder.

As shown in FIG. 2, the height H, width W, and channel number C are setRepresenting the input features of the TEP module, the specific size is 32 x 256. The shape of the input is first reshaped using the View operation, resulting in/>(D=h×w), input after processing/>The embedded position code PE is input into Transformer Decoder to obtain an output characteristic Y after the position code:

in order to significantly reduce the consumption of the self-attention layer in feature map computation and storage, the present application uses both an axial attention module and a standard attention module in the TEP module.

It should be noted that the axial attention module may be implemented by adjusting the shape of the tensor on the width and height axes and then processing with self-attention based on dot product, respectively.

As shown in FIG. 3, unidirectional attention is focused only on context constraints before the token. The bi-directional attention can be focused on all positions before and after the token, but the computational complexity is O (n ²). The axial attention can focus on the available context in the row direction and the column direction of the token (namely focusing on the available information before and after the token), so that the model is more efficient, and the calculation complexity of the axial attention is lower and is only that

Thus, for stable training, the present application uses a layer normalization technique before feature Y is input into the axial attention module. In addition, a learnable relative position code (RPE) is provided for the module to improve the spatial relationship and improve the accuracy and effect of repair. The axial attention score based on the width axis and the height axis can be expressed as

Wherein Y _hi,y_hj represents the feature vector of the i and j columns of the height axis of the feature Y, respectively. W _wq,W_wk,W_hq,W_hk is the weight matrix of queries and keys in the width axis and height axis; representing the relative position coding matrix between the width axes i and j, Representing the relative position coding matrix between the elevation axes i and j. Scaling factor/>The method aims to obtain the gradient more stably during back propagation. The attention weight is then obtained by Softmax operation. This process is expressed as:

Then, will The layer normalization is standard normal distribution, the training and convergence speed of the model is accelerated, the model is sent to a feedforward network for operation, and residual error connection is added to prevent network degradation and make the operation dimension of the matrix consistent. This process is expressed as:

Wherein, MLP and LN respectively represent a multi-layer perceptron and layer normalization. The MLP consists of 2 linear mapping layers FC, an activation function and a residual dropout layer, the first FC converting the features of the D dimension into 4D dimension features, and then the second FC converting it back into features of the D dimension. And GELU is adopted in the middle as a nonlinear activation function, and finally, a residual dropout layer performs regularization operation to prevent overfitting.

In addition, the application also utilizes the standard Attention module Attention ^norm to learn the global correlation, and repeats the process of the formula (7) to obtain the final output

2) Edge generator and discriminator design

The edge generator is based on a self-encoder structure by characterizing a given imageAnd performing the processes of encoder data compression, bottleneck layer characteristic reconstruction and decoder decompression to complete edge prediction. The method comprises three stages: encoding phase, feature reconstruction phase and decoding phase:

in the encoding stage, the encoder first uses a7×7 convolution with a reflection fill parameter of 3 and a step size of 1 to obtain rich features, which will be adjusted To a size of 256 x 64. Then, by convolving with a convolution kernel size of 4×4 with a continuous three-layer stride of 2, a shallow feature/>, with a size of 32×32×256, is obtainedUnlike patch-based embedding methods, the convolution operation described above injects a beneficial convolution induction bias for the TEP block.

The bottleneck layer does not use residual blocks based on convolution, but selects and stacks eight TEP modules based on a transform structure to form an information bottleneck layer so as to enhance the characterization capability of characteristic information and the capturing capability of global structure information and further complement missing edge information. Unlike convolution, the self-attention mechanism is able to capture non-local information from the entire signature. But this calculation amount for the similarity calculation is very large. Therefore, in the TEP module, through the alternate use of the axial attention layer and the standard attention layer, the generator is ensured to complement the image edge conforming to the whole semantics by utilizing the global context information, and simultaneously, the performance and the parameter quantity of the transducer are considered. Further, by calculating the attention alone in the height and width axes, the axial attention layer can acquire multi-directional features to enhance the feature map direction awareness. After passing through the bottleneck layer, reconstructed features with dimensions 32×32×256 are obtained.

The feature is up-sampled to 256×256×64 by using 3 transpose convolutions of convolution kernel size 4×4 with zero padding operation 1 and step size 2, and the output is adjusted to 256×256×1 by 1 convolution of convolution kernel size 7×7 with reflection padding parameter 3 and step size 1, resulting in a predicted full edge map. In addition, each convolution layer of the edge generator adopts example normalization, so that model convergence is accelerated, and nonlinear expression of the feature extraction module is improved. At the same time, each convolution layer is followed by a ReLU activation function to reduce the gradient vanishing phenomenon.

In order to increase the focus of the network on local detail during training, the present application selects PatchGAN architecture as the basic framework for the edge discriminator. The structure consists of 5 layers of convolutions with steps of 2, 1, respectively, and convolution kernel sizes of 4 x 4. Each layer is convolved and then subjected to spectrum normalization and a leak-ReLU activation function. The input image is computed as a single channel signature of 30 x 30 size by a 5 layer convolution operation. And finally, mapping the output into scalar in the range of [0,1] by using Sigmod functions, effectively judging whether the input sample is true or false, and promoting the network to generate a high-quality repair result.

2. For the design thought of the two stages, the application designs the framework of the image patching stage, which is as follows:

1) Multi-scale fusion attention module

In the texture synthesis stage, existing methods typically use convolutional layer stacks to extract feature information of shallow details into high level semantics. However, shallow rich spatial structure information and texture details are serially extracted along with convolution kernels of fixed size, so that feature loss of different degrees can occur, global context information inconsistency is aggravated, and global structure information is not beneficial to capturing from distant pixels.

Aiming at the problem, the MFA module firstly adopts different expansion factors to carry out convolution operation on the input features to extract information of different scales in parallel, then carries out scaling treatment on the features of each scale, inputs the scaled features into the ATN module, realizes information transmission among features of different levels, helps a model to better understand details and structures in an image, and improves the quality and accuracy of restoration.

Furthermore, the use of efficient channel attention and pixel attention enables models to selectively focus on important channels and pixels, thereby reducing unnecessary computations and parameter amounts. The residual structure and the jump connection help to avoid difficulties of gradient explosion and network convergence. See fig. 4 for details.

Specifically, the present application first uses a 1X 1 convolution operation to change the dimension of the input feature F _in of the MFA module from 64X 256 to 64X 64, the transformed input features are expressed as:

where f _1×1 (·) represents a1×1 convolution operation, τ represents an instance normalization and ReLU activation function. Extraction of multi-level features using parallel extended convolution The expansion rates Ri are 8, 4, 2 and 1, respectively, (i=1, 2,3, 4). Convolution with smaller expansion rate can better sense texture and position information, and convolution with larger expansion rate can sense advanced and global characteristic information. Then, a high-efficiency channel attention module (ECA) is introduced, so that cross-channel interactive information is effectively captured, and the influence of redundant features is reduced. The calculation formula can be expressed as:

wherein ECA (-) represents an efficient channel attention operation, A3 x 3 convolution operation with a dilation rate Ri is shown. Since deep feature maps are typically more compact, the features of each layer are scaled. As shown in fig. 4, the scaling factor Si is 1/8,1/4,1/2,1 in order. The scaling process is defined as follows:

Wherein the method comprises the steps of Representing a bilinear interpolation downsampling operation with a scaling factor of Si. And then, introducing an Attention Transfer Network (ATN) [33], and guiding the complementation of the low-level features layer by the high-level features so that the complementation content has reasonable semantics and clear textures. The specific operations may be expressed as:

Wherein ATN (·) represents an attention transfer operation, Is a feature map of the ATN reconstruction of the i-th layer. The application constructs local residual error connection for the characteristic diagram reconstructed by the last ATN to reduce information loss:

Where PA (·) represents the pixel attention operation. In addition, in order to ensure the consistency of the local context characteristic information, the application uses operations such as jump connection, 3X 3 convolution and the like to ensure the multi-level characteristics with the same size Fused into feature map/>, with size 64×64×128Finally, feature map/>And/>Fusion output:

2) Design of image generator and discriminator

The image generator is improved on the basis of the self-encoder. Since the bottleneck layer does not use a transform structure, the large-sized feature map input bottleneck layer does not result in excessive computational complexity, so the encoder downsamples only to 64×64×256 feature map dimensions.

To synthesize realistic textures in different regions, the bottleneck layer uses 4 MFAs stacked instead of the full connection layer. By the cooperation of the hierarchical dilation convolution and the attention transfer strategy, the generator can independently synthesize new contents with correct semantics under the condition of partial edge information loss.

Otherwise, the image generator and the corresponding part of the edge generator have the same network structure and parameter settings.

It is emphasized that: the image discriminator structure is similar to that of the edge prediction discriminator, and the discriminator still employs a 70×70 PatchGAN discriminator network consisting of 5 layers of convolutions with a convolution kernel size of 4×4.

The discriminator network performs spectrum normalization and a leak-ReLU activation function process after each layer of convolution operation. The last layer outputs a two-dimensional matrix of size N x N, where each element in the two-dimensional matrix corresponds to a true or false value of a 70 x 70 block of regions, and the average value of all elements is taken as the output value of the discriminator. Compared with the traditional GAN discriminator, the PatchGan discriminator judges authenticity of each Patch, the method can pay more attention to texture details, and therefore quality of generated pictures is improved.

3. Design of loss function

1) Loss of edge prediction network

The edge prediction network adopts joint loss to carry out model training, including loss resistance and feature matching loss, so as to obtain a clear and real edge prediction graph. Specifically, resistance lossThe expression of (c) can be written as:

Where E is the mathematical expectation and D ₁ is the discriminator function. Feature matching loss L _fm evaluates the quality of the generated edge by measuring the Euclidean distance between the predicted edge and the original edge in feature space, which can be defined as:

Wherein S represents the total number of layers of D ₁, K-th layer activation map of D ₁, N _k is/>The number of elements in (i) is the calculation of euclidean distance. The joint loss L _E of the edge prediction network is expressed as:

loss weight in formula And α _fm are the weights of the countermeasures loss and the feature matching loss, respectively. ICT-based parameter settings, will/>, in all experiments of the present applicationAnd lambda _fm will be set to 1 and 15, respectively.

2) Loss of image restoration network

In order for the image restoration results to have reasonable semantic content, consistent structure, clear texture, various loss functions including fight loss, perception loss, style loss, and reconstruction loss are used herein to train the image restoration network. Loss of resistanceExpressed as:

Next, the present application converts the differences between the pixel values of the generated image and the real image into differences in feature space using a pre-trained VGG-19 network in order to better preserve the high-level semantic information of the image. The perceptual loss L _perc is defined as:

Where M _k represents the feature map size of the k-th layer, A characteristic representation representing a kth layer in the lossy network.

The style loss is usually calculated by using a gram matrix (gram) to calculate the feature map differences, and expresses the correlation of style features on different channels, so that the repaired image is more similar to the style of the reference image. The style loss L _style can be expressed as:

Wherein the method comprises the steps of Is/>And constructing a Gram matrix. Reconstruction loss L _rec is to minimize the sum of the absolute differences of the output result I _pred and the real image I _t by using the L ₁ loss to ensure that the overall profile of the result substantially matches the target. The specific calculation process is shown as follows:

L_rec＝||I_pred-I_t||₁ (22)

the overall loss of the second phase network L _C is denoted as:

combining the work of ICT and experimental tests herein, the present application sets the loss weights as: α _perc＝0.1、α_style =200 and α _rec =0.5.

Experimental case analysis

1) Experimental data set

The application was trained and evaluated on 3 common datasets: celeba, facade and placs. The CelebA dataset contains 202,599 face pictures of celebrities, which are commonly used in computer vision training and testing experiments relating to faces. Approximately 180k images were selected for training and 20k images were tested. For Places2, which contains numerous unique scene categories, such as restaurants, beaches, yards, valleys, etc., the present application randomly selects about 225k images of 50 categories for training, 25k images for testing. Facade dataset focuses mainly on highly structured walls of different cities around the world.

The application uses 556 images for training, and other images are used for testing. In terms of irregular masking, liu et al propose a test set of irregular masking data sets containing 12,000 masks holes and a total of 6 different areas.

During the experiment, the present application randomly used a mask test set of different scale ranges to mask the image, and all images and irregular mask sizes were adjusted to 256 x 256 pixels.

2) Experimental details

The experimental hardware was configured as a single Intel (R) Core i7-11700 CPU, a single NVIDIAGeForce RTX 309024GB GPU,64.0GB RAM, and the software environment was Windows 10, pytolchv1.7.0, CUDAv11.0. Both stage networks were trained with Adam optimizer (β ₁＝0.0,β₂ =0.9), batchsize size 8, initial learning rates of generator G ₁,G₂ and discriminator D ₁,D₂ were 1×10 ^-4 and 1×10 ^-5, respectively, multi-HeadAttention set to 8 in the super parameters of TEP modules of the edge prediction network, and embedding dimension (embeddingdimension) set to 256.

Training of the EPAM model is divided into three steps:

First, the present application trains G ₁ using the gray-scale image and the edge binary image as training samples. After G ₁ loses balance, the application adjusts the learning rate to be 1 multiplied by 10 ^-5, and continues training the generator until the model converges, thereby generating a predicted edge.

Secondly, the application utilizes the complete image edge information and the damaged image detected by the Canny operator to be synthesized as the input of G ₂, and separately trains G ₂. After the image generator loses balance, the learning rate is reduced to 1 multiplied by 10 ^-5, and training is continued until convergence.

Finally, the present application concatenates G ₁ and G ₂, removes edge discriminator D ₁, and continues end-to-end training of generators G ₁ and G ₂ at a learning rate of 1 x 10 ^-6 until the model converges.

Referring to FIG. 5, a training loss graph of the proposed model on Places2 dataset is shown. During the whole training period, the application samples the loss value of the latest batch of data every time the iteration reaches 5000 times, and the two-stage network respectively performs 200 ten thousands of iterations.

In the experiment, as the span of each loss value in the network is larger, the application amplifies the red area of the graph so as to obtain better visual effect.

Fig. 5 (a) shows the trend of the variation of different loss functions during the edge prediction network training process, the antagonism loss of G ₁ fluctuates in the range of (0.5,1.7), and the antagonism loss of D ₁ floats around (0.45,0.75), which indicates that G ₁ and D ₁ continuously perform the antagonism training, and then tend to be stable. As the number of iterations increases, the feature matching penalty is stabilized by constraining the output of the intermediate layer of the discriminator.

In fig. 5 (b), the reconstruction loss, perception loss and style loss gradually decrease as training progresses. This indicates that the gap between the generated samples and the real samples at the feature map or pixel level is shrinking and the quality of the generated samples is gradually increasing.

3) Analysis of experimental results

In order to evaluate the effectiveness of the EPAM model, the application compares the proposed EPAM model with the following advanced repair algorithms quantitatively and qualitatively: EC. CTSDG, ICT, MAT and PUT. In order to embody generalization of the method, the application randomly uses an irregular mask, a central rectangular mask and a manual labeling mask to carry out shielding experiments on samples in qualitative comparison.

In addition, the application also conducts extensive quantitative comparisons, ablation studies and visual analysis to demonstrate the effectiveness of TEP and MFA.

3.1 Quantitative analysis process and results

In order to objectively and comprehensively analyze the performance of the proposed method and other methods, four commonly used indexes are adopted to evaluate the repair result. Peak signal-to-noise ratio (PSNR) is a commonly used objective measurement method for image quality, but the quality of the evaluation result is different from the human eye feeling. Structural Similarity (SSIM) is the evaluation of similarity between images using three factors that fit human perception, namely brightness, contrast and structure, with a window size of 51 used in the calculation.

Further, learning perceived image block similarity (LPIPS) quantifies differences between images from a perceived perspective, better reflecting human perception and understanding of images. Mean Absolute Error (MAE) refers to the average of the absolute error between two values. The present application compares the evaluation scores of the present method with the advanced method on data sets with irregular mask ratios of 20-40%, 40-60% and random.

As can be seen from table 1: on Celeba datasets, the model of the present application gives optimal results on both the PSNR and SSIM indicators. Only in some cases, LIPIPS and MAE were ranked other than first two. This is because, when multi-scale context information and long-distance feature information of a defective image are acquired by using dilation convolutions of different dilation factors, the dilation convolutions are excessive, and zero padding operation causes certain edge artifacts to appear in the repaired image, thereby having certain influence on indexes.

It should be emphasized that, unlike other indicators, LPIPS is the euclidean distance between the feature representations acquired by comparing the repair image with the real image at some intermediate layers of the deep neural network, and some similarities may exist, which cannot be captured by LPIPS, thus causing deviation at the level of the indicators. In Facade dataset, the first two indices of the algorithm of the application are far higher than other algorithms, which indicates that the algorithm of the application is more favorable for repairing fine structural details in images when facing highly structured objects. On the Places2 dataset, although both PUT and MAT are implemented using a transducer architecture, the performance of the method of the present application on each index is overall superior.

Therefore, the edge prediction strategy based on the transducer and the strategy of multi-scale fusion attention provided by the application have remarkable effects on improving the performance of the model.

Table 1 results of the Celeba, facade and Place 2 dataset method comparisons with different mask ratios

The method of the application compares the results with the objective quantification of EC, CTSDG, ICT, MAT, PUT. (∈r represents a lower value and better, +..

3.2 Qualitative comparison procedure and results

The application evaluates and compares the EPAM model with existing methods on Celeba datasets of facial images containing similar semantics.

EC is taken as a repair method of the texture after the structure, and from EC results, the structure prediction of the former step is incomplete, and the detail repair of the latter step is affected.

For example, the human eyes and lips are not sharp enough and the texture is not sharp. CTSDG constrain texture and structure to each other, but this approach does not balance well between texture and structure generation, resulting in local boundary artifacts.

For example: the mouth of the face image has the problems of distortion and missing, and eyes are not natural enough. ICT utilizes a transducer to reconstruct visual priori, and conventional CNN is used to fill texture details, but important information is lost through large-scale downsampling, so that important semantic information is absent in the generated result.

For example: the repaired eye has a loss or distortion. MAT utilizes mask perception mechanism to process large-area image missing region, but for small missing region in image, processing effect is not ideal. For example, the eyes are asymmetric and of non-uniform size. The design of PUT with unquantized transducers improves image quality but does not provide adequate understanding of semantic features.

For example: the repaired eye portion is not coordinated with the face.

Compared with the method, the method provided by the application is more powerful in the aspects of understanding global semantics and keeping more lifelike texture details, and can generate face images with consistent structures and colors.

The present application visually analyzes the effect of the method of the present application on Facade datasets as shown in figure 6. EC are subject to extensive rectangular defects, the windows of the building are lost and there is a noticeable chromatic aberration in the repair result. From the visualization result of CTSDG, it can be seen that the method cannot process larger holes, which can cause the loss of basic components and the distortion of images. ICT restores the obscured region at the pixel level, but does not capture the global structural semantics well.

For example: row 1 and row 2 windows of fig. 6 (d) and their surrounding contours are irregular. MAT reconstructs structures and textures using remote context, but the 2 nd line of FIG. 6 (e) reconstructs mask regions with color discontinuities and semantic objects unreasonable. The PUT synthesized image (fig. 6 (f)) appears to produce reasonable results with virtually noticeable artifacts. In contrast, the model provided by the application can infer reasonable structural semantic information, effectively relieve grid artifact and texture information loss of a local area and improve the perception quality.

For example, in lines 1 and 2 of fig. 6 (g), the window edge repaired by the method of the present application has a clear outline and a good visual effect. In addition, the window predicted by the mask area of the 3 rd line is regular in arrangement and reasonable in semantics.

In fig. 7, the present application further evaluates the model of the present application on a plant 2 dataset containing images of different semantics. In such challenging scenarios, the structure-texture based methods EC and CTSDG do not understand the global context well, resulting in unreasonable completions.

In contrast, the method (ICT, MAT, PUT) using the Transformer architecture can improve global context information capture capability, thereby handling complex image repair tasks. Since the global structure information is not constrained, the problems of inconsistent boundaries and semantic deletion exist in the repair result.

As shown in column 6 of fig. 7, the target structure of the occlusion region reconstruction is incoherent.

Compared with the method, the method designs the structure repair module based on the Transformer, and can reconstruct correct semantic content. Based on accurate appearance priori, the attention transfer strategy applied to the MFA module layer by layer effectively combines shallow detail texture information and deep structure semantic information, and reduces loss of long-distance features in a deep network, so that consistency of global and local features of an image is effectively enhanced.

For example: lines 1-4 of FIG. 7 (g) produce high quality texture and structural details in the missing regions.

To further demonstrate the effectiveness of the proposed method, the present application compares the trained model to similar repair protocols (EC and CTSDG).

The application shows in fig. 8 the edge results and the final repair results of EC, CTSDG and the method of the application in Celeba, facade and placs 2 datasets. From columns 2 and 3, it can be seen that the edge reconstructed by these repair methods is a priori not correctly predicted the semantic profile of the building window, the door of the drum washing machine, etc.

Similarly, it can be found from the repair results of columns 5 and 6 that the original image features cannot be displayed on the whole structure although the color details are similar. Overall, both EC and CTSDG models cannot recover reasonable images of buildings and faces based on offset edge priors.

In contrast, the method of the application introduces a relative position coding and self-attention mechanism, strengthens the extraction capability of edge features, enables core edge information to be repaired (fig. 8 (d)), and can recover target boundaries matched with scene semantics in a central region of a large-scale mask.

In addition, as shown in column 7 of fig. 8, the repairing output texture obtained by the proposed method is finer and more realistic. In addition, in the face of local edge priors, the MFA module also fuses global context features and shallow features to guide the model in synthesizing novel content (checkered structure in right gate line 6 of fig. 8 (g)).

4) Visualization of analytical processes and results

To demonstrate the effectiveness of TEP modules in applying position-sensitive axial attention, the present application visualizes the attention weights of the axial attention layer on CelebA datasets.

As shown in fig. 10, the present application selects block 2 of TEP modules to visualize thermodynamic diagrams of 8 heads of column (height axis) and row (width axis) attention, respectively.

In order to clearly show the Attention points of each head, the Attention Map with 32 multiplied by 32 pixels is amplified to 256 multiplied by 256, and then is overlapped with the original image to obtain the Attention weight thermodynamic diagram. The present application notes that some heads learn to focus on relatively localized areas, while others focus on remote contexts.

For example, in the column-wise attention of the present application, column heads 1, 5, and 6 focus on relatively localized areas of the head, column heads 2, 3, and 8 encompass the entire image. In the line-oriented attention of the present application, row heads 1,2 and 3 are associated with the semantic concept of part of eyes, mouth, nose, etc. of a face. rowhead 4, 5, and 7 are more focused on the far row context.

To verify the effectiveness of the ATN structure as applied within the MFA module, the present application visualizes the attention-fraction heatmap of ATNs at different scales. In particular the number of the elements,The visualization result is shown in fig. 9, which shows the ATN attention score thermodynamic diagram with resolution 8×8 in the ith block of the MFA module.

In the experiment, columns 2 to 4 show different attention score matrixes from deep to shallow, so that attention from points to local is realized, and then the attention from local to global field is gradually expanded, the consistency of the context characteristics can be effectively enhanced, and the processing capacity of the model on the characteristics of different scales is improved.

Second, the present application visualizes a partial ATN signature of the MFA module, specifically,An ATN signature of 64×64 size is shown for the kth block of the MFA module. As shown in columns 2 to 5 of fig. 10, the MFA module obtains a feature map with multi-level semantic information by applying an attention transfer strategy, so as to reduce information loss or confusion caused by feature scale variation. Feature maps are gradually reconstructed and optimized from the first block to the fourth block.

5) Ablation experimental procedure study

The application performs a comprehensive ablation experiment based on Facade dataset to analyze qualitative and quantitative differences between different components of the proposed model.

As shown in fig. 6 and table 3. The present application contemplates different combinations of networks. These experiments included: (b) Removing the whole TEP module and the MFA module, replacing the whole TEP module with residual blocks (-TER-MFA) from EC, (c) removing the whole TEP module, replacing the whole TEP module with 8 residual blocks (-TER), (d) removing only Axial-attention (-AA) in the TEP module, (e) removing the MFA module, replacing the whole TEP module and the MFA module with 4 residual blocks (-MFA), (f) applying the complete TEP module and the MFA module (+MFA), namely the network structure proposed by the application.

As shown in column (c) of fig. 10, since the network of residual blocks lacks understanding and analysis of the global structure, it is difficult for the component to correctly predict the edge information of the large-mask center region window. It can be seen from column (d) of fig. 10 that in the face of a large irregular mask, only the edge prediction network of self-attention does not explicitly encode position information, limiting the modeling ability of the model to image local structures, resulting in poor repair results.

It can be seen that the introduction of the MFA module into the image restoration network herein enhances the relevance of the multi-level features, balances the concerns of visual content and generated content, and thereby realistically restores texture details and color information of the damaged area.

As shown in the columns (e) and (f) of fig. 10, on the premise that the edges are priori consistent, the composite graph obtained without applying the MFA module has the problem that the color information of the wall inside and outside the hole is inconsistent. The visual effect inside and outside the hole obtained by the application is natural, and the area with fuzzy texture is minimum.

To make the comparison of the ablation experiments more specific, a quantitative comparison of the model of different component compositions under index PSNR, SSIM, LPIPS and MAE is given in table 2. As can be seen from a comparison of lines 3, 4, and 6 of Table 2, the use of Axial attention in combination with self-attention significantly improves the performance of the model.

Furthermore, in fig. 11, accuracy and precision graphs with and without Axial attention are also shown by the present application. The red curve (using Axial attention) has better performance. In a repair network, the present application attempts to use MFA modules to reference distant features of different levels to extract contextual information in the input feature map. The quantitative results of line 6 of table 2 are significantly better than the results of the residual block (line 5 of table 2), which illustrates that the MFA module can help the image restoration generator learn more efficient image feature information, thereby improving the model texture synthesis effect.

TABLE 2 quantitative ablation analysis of the proposed method on Facade datasets

6) Model complexity experimental process and analysis

For computational complexity analysis of the model, the application compares the EPAM model with other methods in terms of total parameter quantity, memory consumption and test time. Where "Total parameters" represents the number of all parameters in the model, including trainable parameters and untrainable parameters. "Total memory" indicates the size of the memory space required during the model test. "run time" refers to the repair time of a single image.

As is apparent from table 3, the proposed method has the least number of parameters and the lowest memory cost compared to other methods (ICT, MAT, PUT) applying the transducer architecture. The method of the present application is also less time consuming than all contrast methods when repairing a single image.

TABLE 3 complexity analysis of different models, the lower the number the better

From the main ideas of the researches of the inventor, it can be seen that the image restoration method based on deep learning has made great progress in reconstructing damaged areas recently. However, in the face of missing images of large holes, repair results often suffer from structural distortion and texture blurring. In this context, the present application combines the advantages of a transducer and convolution, and proposes an image restoration method (EPAM) that combines edge priors with an attention mechanism. The method divides the repair task into two phases: edge prediction and image restoration.

Specifically, in the edge prediction stage, the application designs a transducer framework combining axial attention and standard self-attention, which enhances the extraction capacity and the position sensing capacity of global structural features, balances the complexity of self-attention operation and further realizes the accurate prediction of the edge structure of a defect area. Then, in the image restoration stage, the application provides a multi-scale fusion attention module which fully utilizes multi-level long-distance characteristics, enhances local pixel continuity and obviously improves the restoration quality of images.

The present application conducted comparative experiments on multiple data sets including CelebA, places and Facade. Quantitative experiments show that compared with other methods, the PSNR and SSIM indexes of the method are respectively improved by 1.141-3.234 db and 0.083-0.235, and the PSNR and SSIM indexes of the method are respectively reduced by 0.0347-0.1753 and 0.0104-0.0402. Qualitative results show that the method can reconstruct the image with complete structural information and clear texture details. Furthermore, the model of the present application is excellent in terms of the number of parameters, memory costs and test time.

While the invention has been described in connection with specific embodiments, it will be apparent to those skilled in the art that the description is intended to be illustrative and not limiting in scope. Various modifications and alterations of this invention will occur to those skilled in the art in light of the spirit and principles of this invention, and such modifications and alterations are also within the scope of this invention.

Other prior art reference catalogues

[26]C.CAO and Y.FU,"Learning a sketch tensor space for image inpainting of man-made scenes,"in Proc.of the IEEE/CVF International Conference on Computer Vision,Montreal,QC,Canada,pp.14509-14518,2021.[27]A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones et al.,"Attention is all you need,"in Proc.of Advances in Neural Information Processing Systems,Long Beach,CA,USA,pp.5998—6008,2017.

[28]H.Chen,Y.Wang,T.Guo,C.Xu,Y.Deng et al.,"Pre-trained image processing transformer,"in Proc.of the IEEE/CVF Conf.on Computer Vision and Pattern Recognition,Nashville,TN,USA,pp.12299-12310,2021.

[29]J.Devlin,M.W.Chang,K.Lee andK.Toutanova,"Bert:Pre-training of deep bidirectional transformers for languageunderstanding,"arXiv:1810.04805,2018.[Online].Available:https://arxiv.org/abs/1810.04805.

[30]C.Zheng,T.J.Cham,J.Cai and D.Phung,"Bridging global context interactions for high-fidelity image completion,"in Proc.of the IEEE/CVF Conf.on Computer Vision and Pattern Recognition,New Orleans,LA,USA,pp.11512-11522,2022.

[31]C.Raffel,N.Shazeer,A.Roberts,K.Lee,S.Narang et al.,"Exploring the limits of transfer learning with a unified text-to-text transformer,"The Journal of Machine Learning Research,vol.21,no.1,pp.5485-5551,2020.

[32]J.Ho,N.Kalchbrenner,D.Weissenborn and T.Salimans,"Axial attention in multidimensional transformers,"arXiv:1912.12180,2019.[Online].Available:https://arxiv.org/abs/1912.12180.

[33]Y.Zeng,J.Fu,H.Chao andB.Guo,"Learning pyramid-context encoder network for high-quality image inpainting,"in Proc.of the IEEE/CVF Conf.on Computer Vision and Pattern Recognition,Long Beach,CA,USA,pp.1486–1494,2019.

[34]A.Dosovitskiy,L.Beyer,A.Kolesnikov,D.Weissenborn,X.Zhai,et al.,"An image is worth 16x16 words:Transformers for image recognition at scale,"arXiv:2010.11929,2020.[Online].Available:https://arxiv.org/abs/2010.11929.

[35]J.L.Ba,J.R.Kiros and G.E.Hinton,"Layer normalization,"arXiv preprint arXiv:1607.06450,2016.[Online].Available:https://arxiv.org/abs/1607.06450.

[36]P.Isola,J.Y.Zhu,T.Zhou and A.A.Efros."Image-to-image translation with conditional adversarial networks,"in Proc.of the IEEE Conf.on Computer Vision and Pattern Recognition,Honolulu,HI,USA,pp.1125-1134,2017.

[37]M.S.M.Sajjadi,B.Scholkopf and M.Hirsch,"EnhanceNet:Single image super-resolution through automated texture synthesis,"in Proc.of the IEEE Int.Conference on Computer Vision,Venice,Italy,pp.4491–4500,2017.

[38]Q.Wang,B.Wu,P.Zhu,P.Li,W.Zuo et al.,"ECA-Net:Efficient channel attention for deep convolutional neural networks,"in Proc.of the IEEE/CVF Conf.on Computer Vision and Pattern Recognition,Seattle,WA,USA,pp.11534–11542,2020.

[39]M.Chen,S.Zang,Z.Ai,J.Chi,G.Yang,et al.,"RFA-Net:Residual feature attention network for fine-grained image inpainting,"Engineering Applications of Artificial Intelligence,vol.119,no.2023,pp.105814:1-105814:10,2023.

[40]T.C.Wang,M.Y.Liu,J.Y.Zhu,A.Tao,J.Kautz et al.,"High-resolution image synthesis and semantic manipulation with conditional gans,"in Proc.of the IEEE Conf.on Computer Vision and Pattern Recognition,Salt Lake City,UT,USA,pp.8798-8807,2018.

[41]J.Johnson,A.Alahi and L.Fei-Fei,"Perceptual losses for real-time style transfer and super-resolution,"in Proc.of European Conference on Computer Vision,Amsterdam,the Netherlands,pp.694-711,2016.

[42]L.A.Gatys,A.S.Ecker and M.Bethge,"Image style transfer using convolutional neural networks,"in Proc.of the IEEE Conf.on Computer Vision and Pattern Recognition,Las Vegas,NV,USA,pp.2414-2423,2016.

[43]K.Simonyan and A.Zisserman,"Very deep convolutional networks for large-scale image recognition,"arXiv:1409.1556,2014.[Online].Available:https://arxiv.org/abs/1409.1556.

[44]G.Liu,F.A.Reda,K.J.Shih,T.C.Wang,A.Tao et al.,"Image inpainting for irregular holes using partial convolutions,"in Proc.of European Conference on Computer Vision,Munich,Germany,pp.85–100,2018.

[45]W.Li,Z.Lin,K.Zhou,L.Qi,Y.Wang et al.,"MAT:Mask-Aware Transformer for Large Hole ImageInpainting,"in Proc.ofthe IEEE/CVF Conf.on Computer Vision and Pattern Recognition,New Orleans,LA,

USA,pp.10758-10768,2022.

[46]Q.Liu,Z.Tan,D.Chen,Q.Chu,X.Dai et al.,"Reduce information loss in transformers for pluralisticimage inpainting,"in Proc.ofthe IEEE/CVF Conf.on Computer Vision andPattern Recognition,New Orleans,LA,USA,pp.11347-11357,2022.

R.Zhang,P.Isola,A.A.Efros,E.Shechtman and O.Wang,"The unreasonable effectiveness ofdeep features as aperceptual metric,"in Proc.ofthe IEEE Conf.on Computer Vision andPattern Recognition,Salt Lake City,UT,USA,pp.586-595,2018.

Claims

1. The image restoration method based on the combination of the edge prior and the attention mechanism is characterized by comprising the following steps of: the method comprises an edge prediction stage and an image inpainting stage; wherein,

Edge prediction stage

Image inpainting stage

2. The method of claim 1, wherein: the edge prediction stage is realized through an edge prediction network; the image restoration stage is realized through an image restoration network; wherein,

obtaining a repair output identical to the original size:

Wherein, I _t,I_gs and E _t respectively represent an original image, a corresponding gray scale image and an edge structure diagram; the incomplete image is represented as The gray level of an incomplete image is expressed as/>Defect edges are denoted/>Wherein +.; e _comp denotes the synthesized edge prediction graph.

3. The method of claim 2, wherein: the edge generator is based on a self-encoder structure and completes the edge prediction graph through a decoding stage of encoder data compression, a feature reconstruction stage of bottleneck layer feature reconstruction and a decoding stage of decoder decompression for given image features.

4. A method as claimed in claim 3, wherein: the encoding stage, the feature reconstruction stage and the decoding stage respectively comprise the following steps:

5. The method of claim 3, wherein the edge discriminator comprises a convolution stage and a sample output stage, wherein,

6. The image restoration device based on the combination of the edge prior and the attention mechanism is characterized by comprising an edge prediction module and an image restoration module, wherein:

7. The apparatus of claim 6, wherein the edge prediction module comprises:

8. The apparatus of claim 6, wherein the edge prediction module further comprises:

9. An image restoration device comprising at least one processor and a memory, wherein the memory has stored thereon computer instructions, the processor being configured to execute the computer instructions stored on the memory to implement the steps of the image restoration method of any of claims 1-5 in combination with an edge priors and an attention mechanism.

10. A computer program stored on a computer readable recording medium for performing the steps of the image restoration method of edge priors in combination with an attention mechanism as claimed in any of claims 1-5.