CN116523985B - Structure and texture feature guided double-encoder image restoration method - Google Patents
Structure and texture feature guided double-encoder image restoration method Download PDFInfo
- Publication number
- CN116523985B CN116523985B CN202310501736.6A CN202310501736A CN116523985B CN 116523985 B CN116523985 B CN 116523985B CN 202310501736 A CN202310501736 A CN 202310501736A CN 116523985 B CN116523985 B CN 116523985B
- Authority
- CN
- China
- Prior art keywords
- image
- representing
- features
- network
- encoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 230000008439 repair process Effects 0.000 claims description 55
- 230000006870 function Effects 0.000 claims description 43
- 238000009826 distribution Methods 0.000 claims description 25
- 239000011159 matrix material Substances 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 15
- 230000009977 dual effect Effects 0.000 claims description 12
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 10
- 230000002950 deficient Effects 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 7
- 230000008447 perception Effects 0.000 claims description 6
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- 230000007547 defect Effects 0.000 claims description 5
- 238000010586 diagram Methods 0.000 claims description 5
- 230000007774 longterm Effects 0.000 claims description 5
- 230000004927 fusion Effects 0.000 claims description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 238000005315 distribution function Methods 0.000 claims description 3
- 239000013598 vector Substances 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims 1
- 206010047555 Visual field defect Diseases 0.000 abstract description 3
- 238000011156 evaluation Methods 0.000 description 10
- 230000000694 effects Effects 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 230000000007 visual effect Effects 0.000 description 4
- 238000009792 diffusion process Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000000873 masking effect Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/40—Analysis of texture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/80—Geometric correction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/54—Extraction of image or video features relating to texture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/757—Matching configurations of points or features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a double-encoder image restoration method guided by structural and texture features, belongs to the technical field of image restoration, and provides a double-encoder coarse restoration network guided by the structural features and the texture features and a fine restoration network based on a long-short period attention mechanism and a multi-scale receptive field, so that the combined restoration of the structure and the texture of a visual field defect image is realized.
Description
Technical Field
The invention relates to the technical field of image restoration, in particular to a double-encoder image restoration method guided by structure and texture features.
Background
The purpose of the visual field defect image restoration is to restore the mask area in the digital image, fill the mask area with reasonable and vivid content and correct context semantics, restore the panorama and improve the picture texture. It is an important task in computer vision that can be used as an image editing tool to remove unwanted objects and restore defective images, and early image restoration methods have been mainly diffusion-based and block-based. Diffusion-based methods utilize thermal diffusion equations in physics to propagate information around the area to be repaired into the repair area through partial differential equations and variational principles. This method is only applicable to small scale defect repair in images. The block-based method comprises the steps of firstly selecting a pixel point from the boundary of a region to be repaired, taking the pixel point as a center, selecting a texture block with proper size according to the texture characteristics of an image, and then searching a texture matching block closest to the region to be repaired around the region to be repaired to replace the texture block.
However, when the key region and the important structure are defective, the method is not applicable any more, and with the continuous development of the deep learning technology, the Convolutional Neural Network (CNN) and the generated countermeasure network (GAN) based restoration method are widely applied, so that an effective tool is provided for image restoration. The existing image restoration method generally adopts a coder-decoder to extract the structure, texture and context semantics of the image, and then completes the reasonable restoration task on the visual sense of the defective image by means of generating an countermeasure network.
While existing methods can generate realistic and semantically trusted structures and textures of content within the mask region, typically either a single codec is used for repair, or two codecs are used for repair separately, ignoring the association between image structure and texture, which results in insufficient or unmatched structural expression of the texture of the image. The image generation process lacks guidance for joint extraction of image structures and texture features. Therefore, the invention provides a double-encoder coarse repair network guided by structural features and texture features and a fine repair network based on a long-short-period attention mechanism and a multi-scale receptive field, which realize the joint repair of the image structure and texture of the visual field defect.
Disclosure of Invention
The present invention aims to solve the above-mentioned problems, and to provide a structure and texture feature guided dual encoder image restoration method.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: the double-encoder image restoration method comprises a coarse restoration network and a fine restoration network, wherein the implementation steps of the coarse restoration network and the fine restoration network are as follows:
s1: the defect image to be repaired and the binary mask (mask=1) image are taken as the input of a structural encoder and a texture encoder together, and after the two encoders extract image features layer by layer, the structural distribution feature data and the texture distribution feature data are fitted by two Gaussian distribution functions N (0, 1) which are input externally.
S2: the structural feature space and the texture feature space are mapped to the potential space by the cross-semantic attention module, and the decoder recovers the image mask region from the potential space random samples.
S3: in the process of feature extraction, pyramid fusion is carried out on features extracted by the structural encoder and the texture encoder, and the fused feature space is used for guiding image restoration of a decoder to obtain a rough restoration result.
S4: using fused images I fuse And mask image M together as input to the fine repair network.
S5: removing artifacts of a fused image mask region of a thin repair network by using a residual gate convolution network, and extracting image characteristic information;
s6: 3 different receptive fields are designed, useful local features and structural features are automatically filtered, interference of useless detail features in an image is effectively reduced, and the sizes of the 3 different receptive fields are respectively 3×3, 5×5 and 7×7;
s7: and a long-short term attention module is added to solve the problems of fuzzy areas and inconsistent context semantics in the image. In the long-period attention module, the attention weight matrix can be used for linking the decoding characteristic control space context, acquiring the coding characteristic of the fine repair network and completing the repair of the mask area by combining the network decoding characteristic and the coding characteristic.
S8: the decoder captures remote features by linking remote spatial contexts, maintains global semantic consistency of the image, selects finer granularity features and valid semantic features of the encoded features according to local features of the repair image, and gradually reconstructs the image with short-term and long-term attention scores to obtain a fine repair image with high resolution characteristics.
Further, the fitting of the two distributions in S1 adopts a Kullback-Leibler Divergence (KL) divergence;
the KL divergence is used for regularizing the learned importance sampling function, and the sampling function is constrained on a potential prior;
the potential a priori distribution is defined as a gaussian distribution, where the KL regularization of the structural and texture encoders is as follows:
wherein I is m Representing a damaged image; z represents the potential space, which is the compressed data space corresponding to the structural features and the texture features, and the similar data points are smaller in distance in the potential space; q ψ Andimportance sampling functions of image structure distribution and texture distribution respectively; n (0,I) represents a Gaussian distribution; l (L) S KL A KL divergence loss function representing a structural feature; l (L) T KL KL divergence loss function representing texture features.
Further, the cross-semantic attention module in S2 is placed after the dual encoder module, the structural encoder feature space F S And texture encoder feature space F T The two feature spaces are mapped to potential spaces by a 1x1 convolution filter. The cross-semantic attention module calculates the attention of the two feature spaces to obtain the attention scores of the two feature spaces.
Wherein the method comprises the steps of
In the formulas (3) - (4), β j,i Indicating the extent to which the model focuses on the ith location when synthesizing the jth region. N represents the number of roughly restored image pixels; s is(s) ij Is Q in the cross-attention module T And K multiplied by each other; q (F) T ) Representing image texture features; k (F) s ) Representing the image structural features. The calculation formula of the output O of the cross semantic attention module can be finally obtained as follows:
O=αF ST +F S (6)
wherein the method comprises the steps of
V(F S )=W v F S (7)
In the formulas (5) - (7), F ST Represents an attention score; v (F) S ) Representing structural features of the image to be calculated, W v Is a 1x1 convolution filter, alpha is a balance F ST And F S The initial value of the learnable scale parameter of the weight is set to 0; the cross-semantic attention network begins with learning the correlation of structural and textural features and finally extends to learning the interdependence and association of structural and textural features from feature maps.
Further, the coarse restoration result of the image in S3 is reconstructed pixel by using a Mean Absolute Error (MAE) distance:
in the formulas (8) - (9), I C out Representing a rough restoration result of the image; i g Representing a gold standard image (group Truth image); m represents a binary mask image; l (L) C hole A reconstruction loss function representing a defective image mask region; l (L) C valid Representing the reconstruction loss function of the non-masked areas of the defective image. Thus, pixel-by-pixel reconstruction loss L C r The method comprises the following steps:
in the formula (10), lambda rec To reconstruct the loss balance factor, this value is set to 20. Furthermore, for the coarse repair network in fig. 1, the LSGAN method is adopted [9] Compared with the traditional GAN loss function, the method can enable the network training to be more stable, and the generated image is more natural, and is defined as follows:
in the formulas (11) - (12), D represents a discriminator of the GAN network; l (L) D Representing a countering loss function of the GAN network arbiter; e (E) Ig~pdata(Ig) A probability density function representing a gold standard image; l (L) G Representing a countermeasures loss function of the GAN network generator;representing the probability density function of the coarsely reconstructed image.
In summary, the total loss of the coarse repair network is defined as:
further, the fused image I in S4 fuse The formula of (c) is defined as follows:
I fuse =I out_m +(1-M)*I g (14)
in equation (14), the image I is fused fuse Mask area I for coarsely restored image out_m =M×I C out Sum-gold standard image I g Is included in the image data.
Further, the attention weight matrix beta in S7 j,i The calculation formula of (2) is as follows:
wherein the method comprises the steps of
s ij =Q(f di ) T K(f dj ) (16)
In the formulas (15) - (16), β j,i Indicating the extent to which the model focuses on the ith location when synthesizing the jth region. N represents the number of pixels of the fine repair image; f (f) dj Representing decoding characteristics; s is(s) ij Is long and isShort term attention module Q T And K multiplied by each other; k (f) dj ) Representing the input information corresponding to the decoding characteristics. Q (f) di ) T Representing query vectors corresponding to decoded features
Q(f di ) T =(f di ) T =W q f di (17)
In the formula (17), W q Is a 1x1 convolution filter. The self-attention layer formula of the long-short period attention module is expressed as follows:
V D (f dj ) Representing input information corresponding to decoding characteristics to be calculated; in order to realize the combination of the fine granularity characteristic of the encoder and the characteristic of the decoder, the encoder layer and the decoder layer in the global refinement network are connected by adopting jump connection, and the characteristic of the encoder layer and the attention weight matrix beta are used for realizing the combination of the fine granularity characteristic of the encoder and the characteristic of the decoder j,i Score calculation of (2) to obtain remote spatial context characteristics, long-short distance attention layer output F out The calculation formula of (2) is as follows:
in the formula (19), V E (f ei ) Representing input information corresponding to the coding feature to be calculated. The calculation formula of the output O of the whole long-short period attention module is as follows:
O=γ(1-M)F out +Mf e (20)
wherein f e Coding features representing a remote space; m represents a binary mask; gamma is equilibrium F out And f e A learnable scale parameter of the weights in between.
Further, setting the first training target of the refinement network in the step S7 as a reconstruction loss L R r As with the reconstruction loss setting in the coarse repair network, the MAE was used for pixel-by-pixel reconstruction:
in the formulas (21) - (22), I R out Representing a thin repair result of the image; l (L) R hole A reconstruction loss function representing a fused image mask region; l (L) R valid Representing the reconstruction loss function of the non-masked regions of the fused image. The invention also adds in the perception loss [10] And loss of style [11] And performing feature extraction on the image by using the trained VGG-16 network, and calculating the loss of the two in the spatial features. Perception loss L R per The definition is as follows:
in formula (23), F i And (5) representing the i-th layer characteristic diagram in the pretrained VGG-16 network. Style loss L R syle The definition is as follows:
wherein G is i A gram matrix is represented, representing the covariance matrix between features and the correlation between each feature. In summary, the overall loss L of the global refinement network R The method comprises the following steps:
wherein lambda is rec 、λ p And lambda (lambda) s Are balance factors.
Compared with the prior art, the invention has the following beneficial effects:
(1) The dual encoder coarsely repairs a model framework of network extraction structural features and texture features;
(2) A method and a technical route for a dual-encoder coarse restoration network to guide a decoder to carry out image restoration;
(3) A fine repair network model architecture based on a long-short-term attention mechanism and a multi-scale receptive field is connected with algorithm parameter setting of a remote space context.
Drawings
FIG. 1 is a flow chart of a dual encoder generated image restoration method of the present invention;
FIG. 2 is a schematic diagram of a cross-semantic attention module of the present invention;
FIG. 3 is a schematic diagram of a long-short term attention module according to the present invention;
fig. 4 is a visual effect contrast chart of six image restoration methods of the present invention.
Detailed Description
The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
The technical scheme of the invention is that a flow chart of a double-encoder image restoration method is shown in figure 1. The dual encoder image restoration method includes two steps: a coarse repair network implementation step and a fine repair network implementation step. The training targets of the coarse restoration network comprise image data characteristic distribution regularization, image reconstruction loss and network countermeasure loss.
The rough repair network comprises the following steps:
(1) the defect image to be repaired and the binary mask (mask=1) image are taken as the input of a structural encoder and a texture encoder together, and after the two encoders extract image features layer by layer, the structural distribution feature data and the texture distribution feature data are fitted by two Gaussian distribution functions N (0, 1) which are input externally. The fit of both distributions uses a Kullback-Leibler Divergence (KL) divergence. The KL divergence is used to regularize the learned importance sampling function, constraining the sampling function to a potential prior. The potential a priori distribution is defined as a gaussian distribution, where the KL regularization of the structural and texture encoders is as follows:
wherein I is m Representing a damaged image; z represents the potential space, which is the compressed data space corresponding to the structural features and the texture features, and the similar data points are smaller in distance in the potential space; q ψ Andimportance sampling functions of image structure distribution and texture distribution respectively; n (0,I) represents a Gaussian distribution; l (L) S KL A KL divergence loss function representing a structural feature; l (L) T KL KL divergence loss function representing texture features.
(2) The structural feature space and the texture feature space are mapped to the potential space by a cross-semantic attention module (cross-semantic attention module is shown in fig. 2) from which the encoder randomly samples the restored image mask region.
In FIG. 2, the cross-semantic attention module is placed after the dual encoder module, structural encoder feature space F S And texture encoder feature space F T The two feature spaces are mapped to potential spaces by a 1x1 convolution filter. The cross-semantic attention module calculates the attention of the two feature spaces to obtain the attention scores of the two feature spaces.
Wherein the method comprises the steps of
s ij =Q(F T ) T K(F S ) (4)
In the formulas (3) - (4), β j,i Indicating the extent to which the model focuses on the ith location when synthesizing the jth region. N represents the number of roughly restored image pixels; s is(s) ij Is Q in the cross-attention module T And K multiplied by each other; q (F) T ) Representing image texture features; k (F) s ) Representing the image structural features. The calculation formula of the output O of the cross semantic attention module can be finally obtained as follows:
O=αF ST +F S (6)
wherein the method comprises the steps of
V(F S )=W v F S (7)
In the formulas (5) - (7), F ST Represents an attention score; v (F) S ) Representing structural features of the image to be calculated, W v Is a 1x1 convolution filter, alpha is a balance F ST And F S The initial value of the learnable scale parameter of the weight is set to 0; the cross-semantic attention network begins with learning the correlation of structural and textural features and finally extends to learning the interdependence and association of structural and textural features from feature maps.
(3) In the process of extracting the characteristics, the characteristics extracted by the structure encoder and the texture encoder are subjected to gold word
Tower type fusion, wherein the fused feature space is used for guiding the image restoration of the decoder to obtain a rough restoration result. For the rough restoration result of the image, the invention adopts the Mean Absolute Error (MAE) distance to reconstruct pixel by pixel:
in the formulas (8) - (9), I C out Representing a rough restoration result of the image; i g Representing a gold standard image (group Truth image); m represents a binary mask image; l (L) C hole A reconstruction loss function representing a defective image mask region; l (L) C valid Representing the reconstruction loss function of the non-masked areas of the defective image. Thus, pixel-by-pixel reconstruction loss L C r The method comprises the following steps:
in the formula (10), lambda rec To reconstruct the loss balance factor, this value is set to 20. Furthermore, for the coarse repair network in fig. 1, the LSGAN method is adopted [9] Compared with the traditional GAN loss function, the method can enable the network training to be more stable, and the generated image is more natural, and is defined as follows:
in the formulas (11) - (12), D represents a discriminator of the GAN network; l (L) D Representing a countering loss function of the GAN network arbiter; e (E) Ig~pdata(Ig) A probability density function representing a gold standard image; l (L) G Representing a countermeasures loss function of the GAN network generator;representing the probability density function of the coarsely reconstructed image.
In summary, the total loss of the coarse repair network is defined as:
in formula (13), lambda KL Represents the KL divergence loss balance factor, which is set to 20. After the rough repair process is completed, rough repair result I C out The masking region of the image can be restored because of the design of gate convolution in the coarse restoration network, artifacts caused by the masking region are eliminated, but two problems still exist:
(1) the image mask area is still blurred after being repaired;
(2) the content of the image after the completion is consistent with the whole lack of semantics, and the context semantics are inconsistent.
In order to solve the problems, the invention designs a global fine restoration network, which adopts a multi-scale feature extraction mode and a long-period attention module to eliminate a fuzzy region in an image, unifies global semantics, and improves the resolution of an image mask region and the consistency of the global semantics.
The implementation steps of the double-encoder image fine restoration adopt the following algorithm:
(1) using fused images I fuse And mask image M together as input to the fine repair network. Fusion image I fuse The formula of (c) is defined as follows:
I fuse =I out_m +(1-M)*I g (14)
in equation (14), the image I is fused fuse Mask area I for coarsely restored image out_m =M×I C out Sum-gold standard image I g Is included in the image data.
(2) Removing artifacts of the fused image mask region of the fine repair network using a residual gate convolution network, extracting image feature information (in fig. 1, the residual gate convolution network is represented by blue rectangular blocks);
(3) 3 different receptive fields are designed, useful local features and structural features are automatically filtered, interference of useless detail features in an image is effectively reduced, and the sizes of the 3 different receptive fields are respectively 3×3, 5×5 and 7×7;
(4) adding a long-short term attention module to solve the problem of blurred areas and up-down in the imageText semantic inconsistency problem (long and short term attention module red rectangular block is shown in fig. 1, long and short term attention module is shown in fig. 3). In the long-period attention module, the attention weight matrix can be used for linking the decoding characteristic control space context, acquiring the coding characteristic of the fine repair network and completing the repair of the mask area by combining the network decoding characteristic and the coding characteristic. Attention weight matrix beta j,i The calculation formula of (2) is as follows:
wherein the method comprises the steps of
s ij =Q(f di ) T K(f dj ) (16)
In the formulas (15) - (16), β j,i Indicating the extent to which the model focuses on the ith location when synthesizing the jth region. N represents the number of pixels of the fine repair image; f (f) dj Representing decoding characteristics; s is(s) ij Is Q in the long-short period attention module T And K multiplied by each other; k (f) dj ) Representing the input information corresponding to the decoding characteristics. Q (f) di ) T Representing query vectors corresponding to decoded features
Q(f di ) T =(f di ) T =W q f di (17)
In the formula (17), W q Is a 1x1 convolution filter. The self-attention layer formula of the long-short period attention module is expressed as follows:
V D (f dj ) Representing input information corresponding to decoding characteristics to be calculated; in order to realize the combination of the fine granularity characteristic of the encoder and the characteristic of the decoder, the encoder layer and the decoder layer in the global refinement network are connected by adopting jump connection, and the characteristic of the encoder layer and the attention weight matrix beta are used for realizing the combination of the fine granularity characteristic of the encoder and the characteristic of the decoder j,i Score calculation of (a) to obtain remote space context characteristics, long and short distancesAttention layer output F out The calculation formula of (2) is as follows:
in the formula (19), V E (f ei ) Representing input information corresponding to the coding feature to be calculated. The calculation formula of the output O of the whole long-short period attention module is as follows:
O=γ(1-M)F out +Mf e (20)
wherein f e Coding features representing the far space (represented in fig. 3 by orange matrix blocks); m represents a binary mask; gamma is equilibrium F out And f e A learnable scale parameter of the weights in between.
(5) The decoder captures the remote features by linking the remote spatial contexts (in the fine repair network of fig. 1, the context links are connected with orange solid lines), maintains global semantic consistency of the image, selects finer granularity features and valid semantic features of the decoded features according to the environment of the repair image, and gradually reconstructs the image with short-term and long-term attention scores, obtaining a fine repair image with high resolution characteristics.
The first training goal of the refinement network is to reconstruct the loss L R r As with the reconstruction loss setting in the coarse repair network, the MAE was used for pixel-by-pixel reconstruction:
in the formulas (21) - (22), I R out Representing a thin repair result of the image; l (L) R hole A reconstruction loss function representing a fused image mask region (as shown in fig. 1); l (L) R valid Representing the re-masking of non-masked regions of a fused imageAnd constructing a loss function. The invention also adds in the perception loss [10] And loss of style [11] And performing feature extraction on the image by using the trained VGG-16 network, and calculating the loss of the two in the spatial features. Perception loss L R per The definition is as follows:
in formula (23), F i And (5) representing the i-th layer characteristic diagram in the pretrained VGG-16 network. Style loss L R syle The definition is as follows:
wherein G is i A gram matrix is represented, representing the covariance matrix between features and the correlation between each feature. In summary, the overall loss L of the global refinement network R The method comprises the following steps:
wherein lambda is rec 、λ p And lambda (lambda) s Are balance factors.
Comparing the experimental data results:
the color images used in the experiments of the present invention were all from the CelebA-HQ dataset [12] . The high resolution dataset of the CelebA-HQ dataset contained 30000 face images, randomly selecting 27000 images for training and 300 images for testing.
The superior performance of the method of the present invention was verified by comparison with five other representative algorithms. The five comparison algorithms comprise GC algorithm, PIC algorithm, MEDFE algorithm, RFR algorithm and MADF algorithm. For the image quality evaluation index, several common indexes in the image restoration task are adopted: l1 error, peak signal-to-noise ratio (PSNR), structural Similarity (SSIM), fuv Lei Xiete distance (FID), and learning perceived image block similarity (LPIPS). The experimental results of the CelebA-HQ dataset are shown in Table 1 (best evaluation results are shown in bold font, second best evaluation results are shown in underline).
Table 1: celebA-HQ dataset experimental result comparison chart
As can be seen from the color image restoration results in Table 1, in the restoration process of which the mask area is the image center area, the evaluation results of the evaluation index L1 and the LPIPS are the best in six algorithms, which indicates that the error between the experimental image obtained by the method and the pixel value of the gold standard image (group Truth image) is the smallest in all the experimental methods, and the image restoration effect is the best. The evaluation results of the evaluation indexes PSNR and SSIM of the method are inferior to those obtained by the comparison algorithm GC in all experimental methods, which shows that the experimental images obtained by the method have stronger consistency with the gold standard images (group Truth images). The evaluation results of the evaluation index FID of the method are ranked second in all comparison algorithms, which shows that the experimental image obtained by the method has stronger correlation with the gold standard image.
In the restoration process of the mask area being the image random area, the evaluation results of the method evaluation indexes L1, SSIM and FID are the best of six algorithms, the error between the pixel values of the experimental image obtained by the method and the pixel values of the gold standard image (group trunk image) is the smallest in all the experimental methods, and the consistency and the relevance are the strongest, which indicates that the comprehensive restoration effect of the image is the best.
Fig. 4 shows a visual effect comparison of six image restoration methods. The experimental gallery is the CelebA-HQ dataset. Wherein the first line of images is the experimental result that the mask area is the central area of the image. The second line of images is the experimental result of the mask area being a random area. Further, the first column image is a gold standard image (group Truth image); the second column of images are defect images; the third column of images are repair images obtained by the MEDFE algorithm; the fourth column of images are repair images obtained by a GC algorithm; the fifth column of images are repair images obtained by an MADF algorithm; the sixth column of images are repair images obtained by an RFR algorithm; the seventh column of images are repair images obtained by PIC algorithm; the eighth column of images is the repair image obtained by the method.
In CelebA-HQ dataset experiments, MADF and the method of the invention can accurately repair and display the characteristics of human face eyes, nose, mouth, hair and the like. For large-area mask areas, the filling effect of other algorithms is poor, and the filling effect is mainly represented by facial feature blurring and rough texture. Compared with the repairing image obtained by the method, the facial features of the person are clearer and more natural, and the visual effect of human eyes is better.
The experimental environment used pyrach 1.8.0,python 3.6.13,GPU was NVIDIA GeForce RTX 3090. The experimental network contained 14M trainable parameters using Orthogonal Initialization and Adam optimization algorithms. The fixed learning rate of the network training is gamma=10 -4 . The balance factor is empirically set to lambda rec =20、λ kl =20、λ p =0.05、λs=100。
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.
Claims (6)
1. The double-encoder image restoration method guided by the structure and the texture features is characterized by comprising a coarse restoration network and a fine restoration network, wherein the implementation steps of the coarse restoration network and the fine restoration network are as follows:
s1: the defect image to be repaired and the binary mask image are taken as the input of a structure encoder and a texture encoder together, after the two encoders extract image features layer by layer, two Gaussian distribution functions N (0, 1) input from the outside fit the structure distribution feature data and the texture distribution feature data;
s2: the structural feature space and the texture feature space are mapped to the potential space through a cross semantic attention module, and the encoder restores an image mask area from the potential space random sampling;
s3: in the process of feature extraction, pyramid fusion is carried out on features extracted by a structural encoder and a texture encoder, and the fused feature space is used for guiding image restoration of a decoder to obtain a rough restoration result;
s4: using fused images I fuse And mask image M together as an input to the fine repair network;
s5: removing artifacts of a fused image mask region of a thin repair network by using a residual gate convolution network, and extracting image characteristic information;
s6: 3 different receptive fields are designed, useful local features and structural features are automatically filtered, interference of useless detail features in an image is effectively reduced, and the sizes of the 3 different receptive fields are respectively 3×3, 5×5 and 7×7;
s7: adding a long-term attention module and a short-term attention module, solving the problem of fuzzy area and inconsistent context semantics in the image, wherein in the long-term attention module, an attention weight matrix can be used for linking with the context of a decoding feature control space, acquiring coding features of a fine repair network, combining network decoding features and coding features, and finishing the repair of a mask area;
the attention weight matrix beta in the S7 j,i The calculation formula of (2) is as follows:
wherein the method comprises the steps of
s ij =Q(f di ) T K(f dj ) (16)
In the formulas (15) - (16), β j,i Representing the extent to which the model focuses on the ith location when synthesizing the jth region; n represents the number of pixels of the fine repair image; f (f) dj Representing decoding characteristics; s is(s) ij Is Q in the long-short period attention module T And K multiplied by each other; k (f) dj ) Input information corresponding to the decoding characteristics is represented; q (f) di ) T Representing query vectors corresponding to decoded features
Q(f di ) T =(f di ) T =W q f di (17)
In the formula (17), W q Is a 1x1 convolution filter, and the self-attention layer formula of the short-period attention module is expressed as follows:
V D (f dj ) Representing input information corresponding to decoding characteristics to be calculated; in order to realize the combination of the fine granularity characteristic of the encoder and the characteristic of the decoder, the encoder layer and the decoder layer in the global refinement network are connected by adopting jump connection, and the characteristic of the encoder layer and the attention weight matrix beta are used for realizing the combination of the fine granularity characteristic of the encoder and the characteristic of the decoder j,i Score calculation of (2) to obtain remote spatial context characteristics, long-short distance attention layer output F out The calculation formula of (2) is as follows:
in the formula (19), V E (f ei ) The input information corresponding to the coding feature to be calculated is represented, and the calculation formula of the output O of the whole long-short-period attention module is as follows:
O=γ(1-M)F out +Mf e (20)
wherein f e Coding features representing a remote space; m represents a binary mask; gamma is equilibrium F out And f e A learnable scale parameter of the inter-weight;
s8: the decoder captures remote features by linking remote spatial contexts, maintains global semantic consistency of the image, selects finer granularity features and valid semantic features of the decoded features according to the environment of the repair image, and gradually reconstructs the image with short-term and long-term attention scores, obtaining a fine repair image with high resolution characteristics.
2. A structure and texture feature guided dual encoder image restoration method according to claim 1, wherein the fitting of the two distributions in S1 uses Kullback-Leibler Divergence (KL) divergence;
the KL divergence is used for regularizing the learned importance sampling function, and the sampling function is constrained on a potential prior;
the potential a priori distribution is defined as a gaussian distribution, where the KL regularization of the structural and texture encoders is as follows:
wherein I is m Representing a damaged image; z represents potential space, itThe compressed data space corresponding to the structural features and the texture features is formed, and the distance between similar data points in the potential space is smaller; q ψ Andimportance sampling functions of image structure distribution and texture distribution respectively; n (0,I) represents a Gaussian distribution; l (L) S KL A KL divergence loss function representing a structural feature; l (L) T KL KL divergence loss function representing texture features.
3. A structure and texture feature guided dual encoder image restoration method according to claim 1, characterized in that the cross semantic attention module in S2 is placed after the dual encoder module, the structure encoder feature space F S And texture encoder feature space F T Mapping the two feature spaces to potential spaces by a convolution filter of 1x 1; the cross semantic attention module evaluates the attention of the two feature spaces to obtain the attention scores of the two feature spaces:
wherein the method comprises the steps of
s ij =Q(F T ) T K(F S ) (4)
In the formulas (3) - (4), β j,i Representing the degree to which the model focuses on the ith location in synthesizing the jth region, N representing the number of roughly restored image pixels; s is(s) ij Is Q in the cross-attention module T And K multiplied by each other; q (F) T ) Representing image texture features; k (F) s ) The calculation formula for representing the image structural characteristics and finally obtaining the output O of the cross semantic attention module is as follows:
O=αF ST +F S (6)
wherein the method comprises the steps of
V(F S )=W v F S (7)
In the formulas (5) - (7), F ST Represents an attention score; v (F) S ) Representing structural features of the image to be calculated, W v Is a 1x1 convolution filter, alpha is a balance F ST And F S The initial value of the learnable scale parameter of the weight is set to 0; the cross-semantic attention network begins with learning the correlation of structural and textural features and finally extends to learning the interdependence and association of structural and textural features from feature maps.
4. A structure and texture feature guided dual encoder image restoration method according to claim 1, characterized in that the coarse restoration of the image in S3 is a pixel-by-pixel reconstruction using Mean Absolute Error (MAE) distance:
in the formulas (8) - (9), I C out Representing a rough restoration result of the image; i g Representing a gold standard image; m represents a binary mask image; l (L) C hole A reconstruction loss function representing a defective image mask region; l (L) C valid Reconstruction loss function representing non-masked areas of a defective image, therefore, pixel-by-pixel reconstruction loss L C r The method comprises the following steps:
in the formula (10), lambda rec To reconstruct the loss balance factor, the value is set to 20, and in addition, for a coarse repair network, the LSGAN method is used to set the loss function, and compared with the traditional GAN loss function, the method can make network training more stable, and the generated image is more natural, and is defined as follows:
in the formulas (11) - (12), D represents a discriminator of the GAN network; l (L) D Representing a countering loss function of the GAN network arbiter; e (E) Ig~pdata(Ig) A probability density function representing a gold standard image; l (L) G Representing a countermeasures loss function of the GAN network generator;a probability density function representing the coarsely reconstructed image;
in summary, the total loss of the coarse repair network is defined as:
5. a structure and texture feature guided dual encoder image restoration method according to claim 1, wherein said fused image I in S4 fuse The formula of (c) is defined as follows:
I fuse =I out_m +(1-M)*I g (14)
in equation (14), the image I is fused fuse Mask area I for coarsely restored image out_m =M×I C out Sum-gold standard image I g Is included in the image data.
6. A structure and texture feature guided dual encoder image restoration method according to claim 1, characterized in that the first training goal of the refinement network in S7 is set to reconstruct the loss L R r As with the reconstruction loss setting in the coarse repair network, the MAE was used for pixel-by-pixel reconstruction:
in the formulas (21) - (22), I R out Representing a thin repair result of the image; l (L) R hole A reconstruction loss function representing a fused image mask region; l (L) R valid Representing a reconstruction loss function of a non-mask region of the fused image, adding perception loss and style loss, extracting features of the image by using a trained VGG-16 network, and calculating the loss of the two features in a spatial feature; perception loss L R per The definition is as follows:
in formula (23), F i Representing an ith layer characteristic diagram in a pre-trained VGG-16 network; style loss L R syle The definition is as follows:
wherein G is i Representing a gram matrix representing a covariance matrix between features and correlations between each feature; in combination with the above-mentioned,total loss L of global refinement network R The method comprises the following steps:
wherein lambda is rec 、λ p And lambda (lambda) s Are balance factors.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310501736.6A CN116523985B (en) | 2023-05-06 | 2023-05-06 | Structure and texture feature guided double-encoder image restoration method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310501736.6A CN116523985B (en) | 2023-05-06 | 2023-05-06 | Structure and texture feature guided double-encoder image restoration method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116523985A CN116523985A (en) | 2023-08-01 |
CN116523985B true CN116523985B (en) | 2024-01-02 |
Family
ID=87402696
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310501736.6A Active CN116523985B (en) | 2023-05-06 | 2023-05-06 | Structure and texture feature guided double-encoder image restoration method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116523985B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117196981B (en) * | 2023-09-08 | 2024-04-26 | 兰州交通大学 | Bidirectional information flow method based on texture and structure reconciliation |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111292265A (en) * | 2020-01-22 | 2020-06-16 | 东华大学 | Image restoration method based on generating type antagonistic neural network |
CN113129234A (en) * | 2021-04-20 | 2021-07-16 | 河南科技学院 | Incomplete image fine repairing method based on intra-field and extra-field feature fusion |
CN113837953A (en) * | 2021-06-11 | 2021-12-24 | 西安工业大学 | Image restoration method based on generation countermeasure network |
WO2022064222A1 (en) * | 2020-09-25 | 2022-03-31 | Panakeia Technologies Limited | A method of processing an image of tissue and a system for processing an image of tissue |
CN114511463A (en) * | 2022-02-11 | 2022-05-17 | 陕西师范大学 | Digital image repairing method, device and equipment and readable storage medium |
CN114973136A (en) * | 2022-05-31 | 2022-08-30 | 河南工业大学 | Scene image recognition method under extreme conditions |
CN115731597A (en) * | 2022-11-24 | 2023-03-03 | 四川轻化工大学 | Automatic segmentation and restoration management platform and method for mask image of face mask |
CN115829880A (en) * | 2022-12-23 | 2023-03-21 | 南京信息工程大学 | Image restoration method based on context structure attention pyramid network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220187841A1 (en) * | 2020-12-10 | 2022-06-16 | AI Incorporated | Method of lightweight simultaneous localization and mapping performed on a real-time computing and battery operated wheeled device |
-
2023
- 2023-05-06 CN CN202310501736.6A patent/CN116523985B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111292265A (en) * | 2020-01-22 | 2020-06-16 | 东华大学 | Image restoration method based on generating type antagonistic neural network |
WO2022064222A1 (en) * | 2020-09-25 | 2022-03-31 | Panakeia Technologies Limited | A method of processing an image of tissue and a system for processing an image of tissue |
CN113129234A (en) * | 2021-04-20 | 2021-07-16 | 河南科技学院 | Incomplete image fine repairing method based on intra-field and extra-field feature fusion |
CN113837953A (en) * | 2021-06-11 | 2021-12-24 | 西安工业大学 | Image restoration method based on generation countermeasure network |
CN114511463A (en) * | 2022-02-11 | 2022-05-17 | 陕西师范大学 | Digital image repairing method, device and equipment and readable storage medium |
CN114973136A (en) * | 2022-05-31 | 2022-08-30 | 河南工业大学 | Scene image recognition method under extreme conditions |
CN115731597A (en) * | 2022-11-24 | 2023-03-03 | 四川轻化工大学 | Automatic segmentation and restoration management platform and method for mask image of face mask |
CN115829880A (en) * | 2022-12-23 | 2023-03-21 | 南京信息工程大学 | Image restoration method based on context structure attention pyramid network |
Non-Patent Citations (4)
Title |
---|
图像修复方法研究综述;罗海银;计算机科学与探索;第16卷(第10期);全文 * |
基于变分自编码器的人脸图像修复;张雪菲;程乐超;白升利;张繁;孙农亮;王章野;;计算机辅助设计与图形学学报(第03期);全文 * |
曹建芳 ; 张自邦 ; 赵爱迪 ; 崔红艳 ; 张琦 ; .增强一致性生成对抗网络在壁画修复上的应用.计算机辅助设计与图形学学报.(08),全文. * |
结合双编码器与对抗训练的图像修复;李健等;计算机工程与应用;第57卷(第7期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN116523985A (en) | 2023-08-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111784602B (en) | Method for generating countermeasure network for image restoration | |
Xiang et al. | Deep learning for image inpainting: A survey | |
CN112541864A (en) | Image restoration method based on multi-scale generation type confrontation network model | |
CN111787187B (en) | Method, system and terminal for repairing video by utilizing deep convolutional neural network | |
CN109214989A (en) | Single image super resolution ratio reconstruction method based on Orientation Features prediction priori | |
CN111861945A (en) | Text-guided image restoration method and system | |
CN113723174B (en) | Face image super-resolution restoration and reconstruction method and system based on generation countermeasure network | |
CN114943656B (en) | Face image restoration method and system | |
CN114881871A (en) | Attention-fused single image rain removing method | |
CN116523985B (en) | Structure and texture feature guided double-encoder image restoration method | |
CN112801914A (en) | Two-stage image restoration method based on texture structure perception | |
CN115731597A (en) | Automatic segmentation and restoration management platform and method for mask image of face mask | |
CN116310394A (en) | Saliency target detection method and device | |
CN113487512B (en) | Digital image restoration method and device based on edge information guidance | |
Liu et al. | Facial image inpainting using multi-level generative network | |
CN117291803B (en) | PAMGAN lightweight facial super-resolution reconstruction method | |
CN117314778A (en) | Image restoration method introducing text features | |
CN116703750A (en) | Image defogging method and system based on edge attention and multi-order differential loss | |
CN116258632A (en) | Text image super-resolution reconstruction method based on text assistance | |
CN116109510A (en) | Face image restoration method based on structure and texture dual generation | |
Fan et al. | Image inpainting based on structural constraint and multi-scale feature fusion | |
CN116091330A (en) | Image restoration method based on generation countermeasure network | |
CN114862696A (en) | Facial image restoration method based on contour and semantic guidance | |
Bai et al. | Image Inpainting Technique Incorporating Edge Prior and Attention Mechanism. | |
CN115034965A (en) | Super-resolution underwater image enhancement method and system based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |