CN114677311A - Cross-mode image restoration method and device based on attention mechanism - Google Patents
Cross-mode image restoration method and device based on attention mechanism Download PDFInfo
- Publication number
- CN114677311A CN114677311A CN202210205553.5A CN202210205553A CN114677311A CN 114677311 A CN114677311 A CN 114677311A CN 202210205553 A CN202210205553 A CN 202210205553A CN 114677311 A CN114677311 A CN 114677311A
- Authority
- CN
- China
- Prior art keywords
- image
- cross
- feature
- modal
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 97
- 230000007246 mechanism Effects 0.000 title claims abstract description 32
- 230000007547 defect Effects 0.000 claims abstract description 90
- 238000012549 training Methods 0.000 claims abstract description 46
- 238000012546 transfer Methods 0.000 claims abstract description 38
- 230000008439 repair process Effects 0.000 claims abstract description 26
- 238000012360 testing method Methods 0.000 claims abstract description 25
- 238000000605 extraction Methods 0.000 claims abstract description 10
- 230000006870 function Effects 0.000 claims description 102
- 230000002950 deficient Effects 0.000 claims description 31
- 230000008569 process Effects 0.000 claims description 31
- 238000013507 mapping Methods 0.000 claims description 30
- 230000008447 perception Effects 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000005315 distribution function Methods 0.000 claims description 6
- 238000011478 gradient descent method Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 238000005065 mining Methods 0.000 claims description 4
- 230000003595 spectral effect Effects 0.000 claims description 4
- 230000002194 synthesizing effect Effects 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- OAICVXFJPJFONN-UHFFFAOYSA-N Phosphorus Chemical compound [P] OAICVXFJPJFONN-UHFFFAOYSA-N 0.000 claims description 3
- 230000003042 antagnostic effect Effects 0.000 claims description 3
- 230000002596 correlated effect Effects 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 230000009191 jumping Effects 0.000 claims description 2
- 230000000452 restraining effect Effects 0.000 claims description 2
- 239000000284 extract Substances 0.000 claims 1
- 230000003935 attention Effects 0.000 description 30
- 230000000694 effects Effects 0.000 description 8
- 238000007430 reference method Methods 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000005259 measurement Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000010332 selective attention Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000011158 quantitative evaluation Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/73—Deblurring; Sharpening
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/529—Depth or shape recovery from texture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10024—Color image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a cross-mode image restoration method and a cross-mode image restoration device based on an attention mechanism, wherein the method comprises the following steps: selecting a multi-mode data set comprising defect image data, real image data and a touch signal, and dividing the data set into a training set and a test set; designing an attention mechanism-based cross-modal image restoration AGVI model, wherein the model comprises four modules of learnable feature extraction, feature attention transfer, correlation embedding learning and cross-modal image restoration; training the cross-modal image restoration AGVI model by using a training set to obtain an optimal cross-modal image restoration AGVI model structure and network parameters; and performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model by using the touch signal and the defect image data in the test set. According to the method, an attention mechanism is introduced, the image defect area is accurately positioned, the key information in the touch signal is utilized to repair, predict and fill the area, and high-quality and fine-grained repair of the image is achieved.
Description
Technical Field
The invention relates to a cross-mode image restoration method and device based on an attention mechanism, and belongs to the technical field of image restoration.
Background
In recent years, with the rapid development of wireless communication and multimedia technologies, multimedia audiovisual services such as ultra-high definition video and network live broadcast have basically met audiovisual requirements of users. In order to further pursue more sophisticated interactive feelings and scene experiences, more and more users are exploring sensory experiences other than audio-visual, such as touch, smell, taste, etc. With the rapid development of the haptic internet under 5G communication, a new haptic service, namely a multi-modal service, is merged into the traditional audio-visual service, and immersive experience is provided for users in various application scenes such as virtual games, remote education, rehabilitation and medical treatment. In order to support such multi-modal services composed of audio, video and touch, it is very important to realize efficient transmission of heterogeneous code streams. However, in actual transmission, due to the influence of noise, transmission loss and other factors, the receiving end often has a serious problem of image data distortion or loss, and the quality of the multi-mode signal is difficult to be guaranteed.
At present, the problems of image distortion and partial defects in the transmission process are mainly considered from image restoration. The mainstream image restoration method mainly comprises the following steps: 1) based on the signal processing; 2) based on a loss function; 3) three categories based on depth models.
Firstly, image restoration is performed by using typical signal processing techniques, such as low-rank matrix, sparse representation, and the like, and the main idea is to represent an original data matrix as a low-rank part or a sparse part, and analyze and apply the low-rank part or the sparse part. For example, researchers have proposed a sparse optimization method "l0TV-PADMM ", the method converts the image distortion into the problem of equivalent biconvex Mathematical Programming (MPEC) with balanced constraint, finds specific signals with similar structures by using sparse representation, and estimates the defect content of the image. In order to further utilize the local and non-local characteristics of the image, a learner applies transformation learning to the adaptive sparse representation, and an image restoration scheme based on joint low-rank regularization transformation learning is provided. The method is mainly based on signal processing technology to carry out image restoration, has simple principle and simple and convenient operation, but has poor image restoration effect and low quality.
Secondly, the image restoration process is normalized by strengthening the constraint action of the loss function. For example, a scholars fully utilizes the priori knowledge of the visual image to construct a mixed regularized loss function, and searches for error fitting of all image frames through regularization terms of a spatial domain and a spectral domain, so as to realize high-quality image restoration. To further enhance the controllability of image denoising, researchers have also introduced a diversity objective function that correlates input noise with image semantics, minimizes the restoration distance between images, and allows users to manipulate the output image by manually adjusting the noise. The constraint of the method on the loss function strengthens the image restoration process, is beneficial to improving the quality of the restored image, but has defects on the detail capture of the image restoration.
And finally, realizing accurate image restoration by virtue of the superior performance of deep learning. With the rapid development of artificial intelligence technology, deep learning has become the mainstream technology of image restoration at present due to its strong learning ability. For example, researchers have designed a quality enhancement network based on a depth model, which employs a residual network and recursive learning, significantly reduces image artifacts with similar frequency characteristics, and removes noise-induced distortion and blurring. In addition, the denoising problem is modeled into a function optimization process with design cost, model parameters are estimated through semi-supervised learning based on a large number of defect images and a small number of labeled training samples, noise is accurately and effectively removed, and high-quality image restoration is achieved. The method can often achieve a good repairing effect, but has the serious problems of high model complexity, large calculation amount and the like.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art and provides a cross-mode image restoration method and a device based on an attention mechanism, wherein a touch signal is applied to the cross-mode image restoration method and the device; specifically, the method selects a standard multi-Modal dataset model for training and testing, establishes an AGVI (Attention-Guided Cross-Modal Visual Image Inpainting) model based on the tactile signals, and realizes Cross-Modal Image restoration. By adopting the method and the device, the image defect area can be accurately positioned, the high-quality and fine-grained image can be restored under the condition of low model complexity, the image restoration quality is improved, and the terminal service level and the user immersive experience are ensured.
The invention specifically adopts the following technical scheme to solve the technical problems:
a cross-mode image restoration method based on an attention mechanism comprises the following steps:
the method comprises the following steps that 1, a multi-mode data set is selected, wherein the multi-mode data set comprises defect image data, real image data and touch signals, and the multi-mode data set is divided into a training set and a testing set;
step 2, designing a cross-mode image restoration AGVI model based on an attention mechanism, wherein the model comprises the following steps:
the learnable feature extraction module is used for extracting the features of the tactile signals, the defective image data and the real image data and participating in subsequent end-to-end model training;
the transfer characteristic attention module is used for introducing an attention mechanism, positioning the image defect area and acquiring transfer characteristics representing the defect area;
the relevant embedded learning module is used for constructing a relevant embedded learning space by combining the information of the real labels, measuring the difference between the target function minimum prediction label and the real label by adopting classification based on cross entropy while completing a semantic feature learning task by minimizing a semantic associated target function, obtaining a total target function of semantic similarity learning between different modalities in a final relevant embedded learning stage, and excavating the tactile features which are most relevant to an image defect area in the tactile features;
The cross-modal image restoration module is used for performing cross-modal restoration on the defective image data by utilizing the mined tactile characteristics most relevant to the defective image area in combination with the inter-pixel perception constraint loss function, the appearance constraint loss function and the countermeasure loss function;
step 3, training the cross-modal image restoration AGVI model by using a training set to obtain an optimal cross-modal image restoration AGVI model structure and network parameters;
and 4, performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model by using the touch signals and the defective image data concentrated in the test.
Further, as a preferred technical solution of the present invention, the selecting a multi-modal dataset in step 1 specifically includes:
selecting three different modal data including defect image data I, real image data R and tactile signals H to form a multi-modal data set D; the real image data is an original color image signal, the touch signal is touch power spectral density obtained by preprocessing the touch original signal, the defect image data is an image with a defect rate of lambda obtained by preprocessing the real image data, and the value range of lambda is between 0 and 1;
Step (1-2), for data of different modes in the multi-mode data set D, carrying out statistics on real label information Y of the data, namely using a one-hot code to mark a category label of content information represented on each data;
step (1-3) of randomly selecting data with the proportion of alpha from the multi-modal data set D as a training set DtrThe remaining data of 1-alpha ratio is used as test set DteAnd the value of alpha ranges from 0 to 1.
Further, as a preferred technical solution of the present invention, the extracting characteristics of the haptic signal, the defective image data, and the real image data by the learnable characteristic extracting module in step 2 specifically includes:
for the tactile signal H, a gate cycle unit GRU and a 3-layer fully-connected network are adopted as a tactile mapping network to acquire a tactile feature H and a tactile feature prediction label y(h);
Extracting shallow defect image characteristics I and real image characteristics R by adopting an image mapping network formed by a depth convolution neural network for defect image data I and real image data R; the specific process is as follows:
wherein, h, i and r are respectively tactile feature, defect image feature and real image feature, and the dimensionality of h, i and r is respectivelyAndθhand theta iRespectively a haptic mapping network Fh(H;θh) And an image mapping network Fi(I/R;θi) The parameter set of (2).
Further, as a preferred technical solution of the present invention, the acquiring, by the attention transfer module in step 2, a transfer characteristic characterizing the defect region specifically includes:
step A, defining each characteristic value as a characteristic unit for the defect image characteristic i and the real image characteristic r, namelyAndikand rlRespectively representing the kth characteristic value in the defect image characteristic i and the l characteristic value in the real image characteristic r,andrespectively representing the dimensions of the defect image characteristic i and the real image characteristic r; then, normalized inner product is adopted to each i in defect image characteristics ikAnd each of the real image features rlCalculating to obtain cosine similarity between all feature units of the two features, wherein the cosine similarity is specifically represented as follows:
wherein, ck,lExpressing cosine similarity matrix, | | | | represents modulus operation, < |, > represents normalized inner product operation;
step B, transferring the most relevant part of each characteristic unit for characterizing the defect area in the defect image characteristic from the real image characteristic, namely a cosine similarity matrix c for k rows and l columnsk,lTakes the maximum value, this process is expressed as:
Wherein, akRepresenting a feature metric most correlated with the k-th position of the defect image feature i in the real image feature r for the distraction index;
step C, based on the attention transfer index, performing feature selection operation on the real image features r to acquire transfer characteristics t representing the image defect area from the real image features, wherein the process is represented as follows:
tk=rak,
wherein, tkRepresenting the a-th of the selected real image features rkThe characteristic value of each position is transferred to obtain the characteristic value of the kth position in the transfer characteristic t, and then the characteristic value is subjected to characteristic classification through a sigmoid layer to obtain a prediction label y of the transfer characteristic(t)。
Further, as a preferred technical solution of the present invention, the mining, by the associated embedded learning module, the haptic characteristics most relevant to the image defect region in the haptic characteristics in step 2 specifically includes:
step A, constructing a relevant embedding learning space by using real label information Y ═ Y }, Y ∈ 1,2, … and C to finish a semantic feature learning task, wherein the process mainly comprises the step of minimizing a semantic association objective function LremThe realization is as follows:
δpq=hp*tq,
wherein y represents a true label, C represents a total number of training data classes, N represents a total number of training data, spqIs a class association factor, δ pqIn order to be a feature association factor, the feature association factor,andpredictive labels for the pth haptic feature and the qth branch feature, respectively, (.)TDenotes a transpose operation, hpAnd tqRespectively representing the pth tactile feature and the qth transfer feature; the semantic association objective function ensures that in a relevant embedding learning space, the transfer features with the same semantic can assist the touch sense to carry out semantic feature learning, namely the touch features with the highest degree of correlation with the image defect area are extracted from the touch features;
step B, in order to further enhance the feature resolution capability, the difference between the predicted label and the real label is minimized by adopting a loss function based on cross entropy to complete semantic classification, and the process is expressed as follows:
wherein L is1And L2Classification metric objective function, y, representing transfer and haptic features, respectivelypAndtrue and predicted labels, y, representing respectively the p-th haptic featureqAndtrue and predicted labels representing the qth branch feature, respectively;
step C, the step A and the step B are combined to obtain a total objective function of semantic similarity learning among different modes in a relevant embedding learning stage, and the total objective function is expressed as:
Lsim=Lrem+α1L1+α2L2,
wherein L issimFor the final semantic similarity loss function, α 1And alpha2Is hyperparametric, LremTo represent the semantic relatedness between the two modalities.
Further, as a preferred technical solution of the present invention, the step 2 of performing cross-modal image restoration on the defective image data by using the mined haptic features most related to the defective image area in the step 2 specifically includes:
step A, adopting a decoder network to realize cross-modal image restoration; specifically, adding the tactile feature h and the defective image feature i to obtain a repaired image featureThen restoring the image featuresInputting the decoder De to obtain a restored imageNamely, it isThe restoration process is restrained from two aspects of appearance characteristics and perception characteristics by utilizing real image data, so that the restored imageAs similar as possible to the real image R, this process is expressed as:
wherein, thetadeSet of network parameters, L, for the decoder DeaAnd LpRespectively an appearance constraint loss function and a perceptual constraint loss function,is a function of distribution of the restored imageIn the expectation that the position of the target is not changed,is a feature extraction network similar to VGG, | | · |. non-woven phosphor1An L1 norm operation is taken;
b, restraining the distribution structure of the repaired image in the step A; specifically, the distribution of real image data is learned using a countermeasure loss function, defining a countermeasure loss function L advExpressed as:
wherein L isadvIs a penalty function of the discriminator D, thetadFor the set of network parameters of the arbiter D,andrespectively, the true image distribution function Pdata(R)And repairing the image distribution functionD (R; theta)d) Anda rule for discriminating the true image and the restored image respectively for the discriminator DRate;
step C, synthesizing appearance constraint loss functions L in step A and step BaPerceptual constraint loss function LpAnd a penalty function LadvThe loss function that the decoder De ultimately needs to minimize is:
Limp=La+β1Lp+β2Ladv,
wherein L isimpAs a function of the final repair loss, beta1And beta2Are all hyper-parameters.
Further, as a preferred technical solution of the present invention, the training of the cross-modal image restoration AGVI model in step 3 by using a training set specifically includes:
step (3-1) of restoring the imageCombining the training set divided in step 1 and the real label information in step (2-3) into a standardized input training data set Dtr:
Dtr={(In,Rn,Hn,yn),n=1,2,…,N},
Wherein, ynDefective image data I for the n-th group participating in trainingnTrue image data RnAnd tactile signal HnN is the total capacity of training data;
step (3-2), initializing a network parameter set of the AGVI model, wherein the set comprises theta h,θi,θde,θdInitializing these parameters to a standard normal distribution; wherein, thetahAnd thetaiParameter sets for the haptic mapping network and the image mapping network, respectively; thetadeSet of network parameters, θ, for the decoder DedA set of network parameters for discriminator D;
step (3-3), setting the total iteration number as G, and recording the specific iteration number by using G;
step (3-4), training a cross-modal image restoration AGVI model by adopting a random gradient descent method, and specifically comprising the following steps:
firstly, setting a hyper-parameter alpha1,α2,β1,β2Learning rate μ for haptic mapping network and image mapping network1Learning rate mu of decoder and discriminator2;
Step two, calculating the output of each network in the cross-modal image restoration AGVI model:
h=Fh(H;θh);i=Fi(I;θi);r=Fi(R;θi),
step three, starting iteration; updating the parameter set of each network from the negative gradient direction of the target based on a gradient descent method and an Adam optimizer:
wherein L issim(v) is the final semantic similarity loss function Lsim,Limp(v) is the final repair loss function Limp;Andhaptic mapping network F after g +1 and g iterations, respectivelyhImage mapping network FiDecoder network De, network parameter set of discriminator network D;is a derivative;
if G is smaller than G, skipping to the step (3-4) and continuing the next iteration, wherein G is G + 1; otherwise, terminating the iteration;
And (3-5) after G-round iteration, finally outputting the structure and network parameters of the optimal cross-modal image restoration AGVI model.
Further, as a preferred technical solution of the present invention, the performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model in step 4 specifically includes:
step (4-1) and test set D divided in step 1teComprises the following steps:
Dte={(I'j,H'j),j=1,2,…,F},
wherein, I'jAnd H'jDefective image data and tactile signal, I ', forming a pair of the j-th group'jAnd H'jThe method is used for model testing, and F is the total amount of test data;
step (4-2), test set DteThe data in (3) are input in pairs into the optimal cross-mode image restoration AGVI model trained in the step (3), and the output is the restoration image.
The invention also provides an attention-based cross-modal image restoration device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the attention-based cross-modal image restoration method when being loaded to the processor.
By adopting the technical scheme, the invention can produce the following technical effects:
(1) in order to accurately position the defect area of the image, the invention introduces an attention mechanism to carry out weight distribution, focuses on the defect area of the image so as to fully obtain the transfer characteristic of the characteristic defect area and improve the accuracy and the integrity of model repair.
(2) When the similarity of different modes is measured by the model, common semantic and label information is introduced, the difference between transfer characteristics and tactile characteristics is continuously reduced through the double constraints of semantic association and category measurement objective functions, and the partial tactile characteristics with the highest correlation degree with an image defect area in the tactile characteristics are extracted and used for cross-mode restoration of defect image data.
(3) The invention uses a decoder network structure, and realizes the high-quality and fine-grained comprehensive repair of the defective image through a perception constraint loss function and an appearance constraint loss function among pixels and an antagonistic loss function based on image distribution under the constraint of semantic and supervision information. Meanwhile, the image quality is improved from the semantics and the distribution on the basis of not increasing the complexity of the model.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
FIG. 2 is a schematic structural diagram of a cross-modal image restoration model based on an attention mechanism according to the present invention.
FIG. 3 is a structural frame diagram of the device of the present invention.
FIG. 4 is a graph showing the results of comparing the method of the present invention with the conventional method.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
An efficient and accurate cross-mode image restoration method needs to be drawn up, a defect area can be accurately positioned, effective information in a touch signal is mined, and high-quality image data restoration is achieved. In recent years, encoder-decoder networks have achieved good success in the field of image restoration, and attention mechanisms also provide a simple and efficient way for accurately positioning image defect areas. Therefore, the invention provides a cross-mode image restoration method based on an attention mechanism. Based on the selective attention capacity of an attention mechanism, the image defect region can be accurately positioned in a focused manner, and the transfer characteristic of the region is obtained; the correlation embedding learning module is combined with real label information to construct a correlation embedding learning space, when a semantic feature learning task is completed by a minimum semantic correlation target function, the difference between a target function minimum prediction label and a real label is measured by adopting classification based on cross entropy, a final total target function of semantic similarity learning among different modalities in a correlation embedding learning stage is obtained, and the most relevant tactile features in the tactile features and the image defect region are mined; under the constraints of a perception constraint loss function, an appearance constraint loss function and a resistance loss function which fit the difference and the distribution difference among pixels, the generated model based on the self-encoder utilizes the mined tactile features most relevant to the image defect region to realize the cross-mode restoration of the defect image data, and improves the image quality from the semantics and the distribution on the basis of not increasing the complexity of the model.
Specifically, as shown in fig. 1, the present invention relates to a cross-modal image restoration method based on attention mechanism, which specifically includes the following steps:
the method comprises the following steps of 1, selecting a multi-mode data set, wherein the multi-mode data set comprises defect image data, real image data and touch signals, and is divided into a training set and a testing set, and the method specifically comprises the following steps:
selecting three different modality data including defect image data I, real image data R and tactile signals H to form a multi-modal data set D; the real image data is an original color image signal, the tactile signal is tactile power spectral density obtained by preprocessing the tactile original signal, the defect image data is an image with a defect rate of lambda obtained by preprocessing the real image data, and the lambda is 40%.
And (1-2) for the data of different modes in the multi-mode data set D, counting the real label information Y of the data, namely using a one-hot code to print the category label of the content information represented by each data.
Step (1-3) of randomly selecting data with the proportion of alpha from the multi-modal data set D as trainingExercise and Collection DtrThe remaining data of 1-alpha ratio is used as test set D teHere, α is 0.8.
Step 2, designing an attention-based cross-modal image restoration AGVI model, as shown in fig. 2, where the model includes four modules, namely a learnable feature extraction module, a feature attention transfer module, a correlation embedding learning module, and a cross-modal image restoration module, and the modules have functions of: firstly, extracting bottom layer characteristics of real image data, defect image data and tactile signals; secondly, according to the selective attention capacity of an attention mechanism, an image defect area is mainly positioned, and transfer characteristics of the area are obtained; then, introducing similarity measurement and supervision information, constructing a relevant embedded learning space by combining real label information under the constraint of semantic association and category measurement loss functions, minimizing a semantic feature learning task by using a semantic association target function, minimizing the difference between a predicted label and a real label by using a cross entropy-based classification measurement target function, obtaining a total target function of semantic similar learning between different modalities in a final relevant embedded learning stage, and excavating a haptic feature part which is most relevant to an image defect area in haptic features as far as possible; and finally, combining an inter-pixel perception constraint loss function, an appearance constraint loss function and an antagonistic loss function, and performing cross-mode restoration on defective image data by utilizing the mined tactile characteristics most related to the defective image area to ensure the quality of a terminal video signal, wherein the method specifically comprises the following steps:
Step (2-1), a learnable feature extraction module, which is used for extracting the features of the tactile signal, the defective image data and the real image data and participating in the subsequent end-to-end model training, and the concrete implementation process is as follows:
for the tactile signal H, a gate cycle Unit (GRU) and a 3-layer fully-connected network are used as a tactile mapping network to extract a tactile feature H, the GRU network consists of a reset gate and an update gate, the number of the set units is 128, the output dimensions of the fully-connected layer are 1024, 128 and 8, a 128-dimensional tactile feature H is output, and the last layer of the fully-connected layer is a sigmoid layer and used for outputting a tactile feature prediction label y(h)。
For defect image data I and real image data R with a size of 128 × 128, in order to ensure consistency of distribution of defect images and real image features, a deep convolutional neural network is used as an image mapping network to extract hierarchical features, the network comprises 3 convolutional layers and 3 full-connected layers, the number of convolutional kernels is 256, 128 and 64, the size of the convolutional kernel is 4x4, the output dimension of the full-connected layer is 1024 and 128, and the last full-connected layer outputs 128-dimensional defect image features I and real image features R, and the specific process is as follows:
In the above formula, h, i and r are the tactile feature, the defective image feature and the real image feature, respectively, and the dimensions of h, i and r areAndθhand thetaiRespectively haptic mapping network Fh(H;θh) And an image mapping network Fi(I/R;θi) The parameter set of (2).
Step (2-2), a transfer characteristic attention module for introducing an attention mechanism, accurately positioning the image defect area and acquiring the transfer characteristic representing the defect area, wherein the specific implementation process is as follows:
step A, defining each characteristic value as a characteristic unit for the defect image characteristic i and the real image characteristic r, namelyAndikand rlRespectively representing the kth characteristic value in the defect image characteristic i and the l characteristic value in the real image characteristic r,andrepresenting dimensions of the defect image feature i and the real image feature r, respectively. Then, normalized inner product is adopted to each i in defect image characteristics ikAnd each of the real image features rlCalculating to obtain cosine similarity between all feature units of the two features, wherein the cosine similarity is specifically represented as follows:
wherein, ck,lExpressing cosine similarity matrix, | | | | | represents modular operation, < · represents normalized inner product operation.
Step B, transferring the most relevant part of each characteristic unit for characterizing the defect area in the defect image characteristic from the real image characteristic, namely a cosine similarity matrix c for k rows and l columns k,lTakes the maximum value, this process is expressed as:
wherein, akFor distractive indexing, the feature metric that is most correlated with the k-th position of the defective image feature i in the real image feature r is represented.
Step C, based on the attention transfer index, performing feature selection operation on the real image features r to acquire transfer characteristics t representing the image defect area from the real image features, wherein the process is represented as follows:
tk=rak,
wherein, tkRepresenting the a-th of the selected real image features rkThe characteristic value of the k-th position in the transfer characteristic t is obtained by transferring the characteristic value of each position. Finally, in order to enhance the distinguishing capability of the transfer characteristics, the transfer characteristics are classified through a sigmoid layer to obtain a prediction label y of the transfer characteristics(t)。
Step (2-3), a correlation embedding learning module, which is used for constructing a correlation embedding learning space by combining real label information, and measuring the difference between a target function minimum prediction label and a real label by adopting classification based on cross entropy while completing a semantic feature learning task by minimizing a semantic correlation target function, so as to obtain a total target function of semantic similarity learning between different modalities in a final correlation embedding learning stage, and mining the haptic features most relevant to an image defect area in the haptic features, wherein the specific implementation process is as follows:
Step A, constructing a relevant embedded learning space by using category label information Y ═ Y }, Y ∈ {1,2, …, C }, and completing a semantic feature learning task, wherein the process mainly comprises the step of minimizing a semantic association objective function LremThe realization method comprises the following steps:
δpq=hp*tq,
wherein y represents the true label, C represents the total number of training data classes, N represents the total number of training data, spqIs a class association factor, δpqIn order to be a feature correlation factor,andpredictive labels for the pth haptic feature and the qth branch feature, respectively, (.)TDenotes a transpose operation, and hpAnd tqRespectively representing the pth haptic feature and the qth transition feature. By observing the semantic relevance function LremIt can be found that when spq=1,δpqThe larger, LremThe smaller and vice versa. The semantic relevance objective function LremIn the correlation embedding learning space, the transfer features with the same semantics can assist the sense of touch to carry out semantic feature learning, namely, the tactile features with the highest correlation degree with the image defect region are extracted from the tactile features extracted by the learnable feature extraction module.
Step B, in order to further enhance the feature resolution capability, the difference between the predicted label and the real label is minimized by adopting a loss function based on cross entropy to complete semantic classification, and the process is expressed as follows:
Wherein L is1And L2Classification metric objective functions representing the transfer features and the haptic features, respectively. y ispAndtrue and predicted labels, y, representing respectively the p-th haptic featureqAnda true label and a predicted label representing the qth branch feature, respectively.
Step C, synthesizing the loss functions in step A, B to obtain a total objective function of semantic similarity learning between different modalities in the relevant embedded learning stage, which can be expressed as:
Lsim=Lrem+α1L1+α2L2,
wherein L issimFor the final semantic similarity loss function, α1And alpha2Is hyperparametric, LremTo represent the semantic relatedness between the two modalities.
(2-4) a cross-mode image restoration module, configured to perform cross-mode restoration on defective image data by using the mined tactile features most relevant to the defective image region in combination with the inter-pixel perception constraint loss function, the appearance constraint loss function, and the countermeasure loss function, and the specific implementation process is as follows:
step A, adopting a decoder network to realize cross-modal image restoration; specifically, adding the tactile feature h and the defective image feature i to obtain a repaired image featureThen restoring the image featuresInputting the decoder De to obtain a restored imageNamely, it isThe restoration process is restrained from two aspects of appearance characteristics and perception characteristics by utilizing real image data, so that the restored image As similar as possible to the real image R, this process is expressed as:
wherein, thetadeSet of network parameters, L, for decoder DeaAnd LpRespectively an appearance constraint loss function and a perceptual constraint loss function,is a function of distribution of the restored imageThe expectation is that.Is a feature extraction network similar to VGG, | | · |. non-woven phosphor1Operate to take the norm of L1.
In this module, the decoder De includes 2 fully-connected layers and 4 deconvolution layers, the dimensions of the fully-connected layers are 128 and 1024 respectively, the number of deconvolution is 64, 128, 256 and 512, and the output is a 128 × 128 repaired imageThe discriminator D includes 3 convolutional layers and 3 fully-connected layers, the convolutional layer output dimension is 512, 256, 128, the convolutional kernel size is 5x5, the fully-connected layer dimension is 1024, 128, 1, and finally, the number in a range of (0, 1) is output to represent the probability that the input image is a real image.
Step B, after the image is restored in a mode crossing manner, further restricting the distribution structure of the restored image in the step A; specifically, the distribution of real image data is learned using a countermeasure loss function, defining a countermeasure loss function LadvCan be expressed as:
wherein L isadvIs a penalty function of the discriminator D, thetadFor the set of network parameters of the arbiter D, Andare respectively the true image distribution function Pdata(R)And a repair image distribution functionD (R; θ)d) Andthe probabilities of the true image and the restored image being true are identified for the discriminator D, respectively.
Step C, synthesizing appearance constraint loss functions L in step A and step BaPerceptual constraint loss function LpAnd a penalty function LadvThe loss function that the decoder De ultimately needs to minimize is:
Limp=La+β1Lp+β2Ladv,
wherein L isimpAs a function of the final repair loss, beta1And beta2Are all hyper-parameters.
Step 3, training the cross-modal image restoration AGVI model by using a training set to obtain an optimal cross-modal image restoration AGVI model structure and network parameters, which are as follows:
step (3-1) of restoring the imageCombining the training set divided in step 1 and the real label information in step (2-3) into a standardized input training data set Dtr:
Dtr={(In,Rn,Hn,yn),n=1,2,…,N},
Wherein, ynDefective image data I for the n-th group participating in trainingnTrue image data RnAnd tactile signal HnAnd N is the total capacity of the training data.
Step (3-2), initializing a network parameter set of the AGVI model, wherein the set comprises thetah,θi,θde,θdThese parameters are initialized to a standard normal distribution. Wherein, thetahAnd thetaiParameter sets for the haptic mapping network and the image mapping network, respectively; theta deSet of network parameters, θ, for the decoder DedIs the set of network parameters for arbiter D.
And (3-3) setting the total iteration number as G-400, and recording the specific iteration number by using G.
Step (3-4), training an AGVI model by adopting a random gradient descent method; the specific process is as follows:
firstly, setting a hyper-parameter alpha1=10-3,α2=10-4,β1=10-4,β2=10-5Learning rate μ for haptic mapping network and image mapping network10.0005, learning rate μ of decoder arbiter2=0.0003;
Step two, calculating the output of each network in the AGVI model:
h=Fh(H;θh);i=Fi(I;θi);r=Fi(R;θi),
step three, starting iteration; updating the parameter set of each network from the negative gradient direction of the target based on a gradient descent method and an Adam optimizer:
wherein L issim(v) is the final semantic similarity loss function Lsim,Limp(v) is the final repair loss function Limp;Andhaptic mapping network F after g +1 and g iterations, respectivelyhImage mapping network FiDecoder network De, network parameter set of discriminator network D;is the derivative.
If G is less than G, jumping to the step (3-4), adding 1 to the iteration times (G is G +1), and continuing the next iteration; otherwise, the iteration is terminated.
And (3) finally outputting the optimal AGVI model structure and network parameters after G-round iteration.
Step 4, performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model by using the touch signal and the defect image data in the test set, wherein the method specifically comprises the following steps:
step (4-1) and test set D divided in step 1teComprises the following steps:
Dte={(I'j,H'j),j=1,2,…,F},
wherein, I'jAnd H'jDefective image data and tactile signal I 'of the j-th group pair'jAnd H'jThe method is used for model testing, and F is the total amount of test data;
(4-2) test set DteThe data in the step (3) are input in pairs into the optimal cross-mode image restoration AGVI model trained in the step (3), and the output is the restored image.
As shown in fig. 3, the present invention further relates to a cross-mode image restoration apparatus based on attention mechanism, which specifically includes: a memory, a processor, and a computer program stored on the memory and executable on the processor; wherein:
1. a memory is used for storing at least one program.
2. And using a processor for loading at least one program to execute the attention-based cross-mode image restoration method in the embodiment to realize high-quality and fine-grained restoration of the defective image.
Performance evaluation:
the invention carries out experiments according to the process, selects an LMT Material Surface standard data set as an experimental data set, and the data set is published in IEEE TRANSACTIONS ON HANDICS journal in 2017 by the literature of 'Multimodal Feature-based Surface Material Classification' (the authors are Matti Strese, Clemens Schuwerk, Albert Iepure, and Eckey Steinbach). Material information containing three instances of image, sound and tactile acceleration, 80% of the data (tactile, image) from each category were selected as the training set, and the remaining 20% were used as the test set.
The prior method I comprises the following steps: the document "PatchMatch: A random corrected texture algorithm for structural image editing" (by Connelly Barnes, Eli Shechman, Adam Finkelstein, Dan B Goldman et al), proposes a typical block matching image repair algorithm for image repair by finding similar blocks in non-defective regions and copying them to defective regions. This is a restoration method in a modality, and restoration is performed mainly based on information of an image itself.
The prior method II comprises the following steps: document "Context Encoders: Feature Learning by Inpaiting "(Author: deep Pathak, Philipp)Jeff Donahue, Trevor Darrell) proposes an unsupervised image feature learning algorithm based on context pixels, and uses the codec as a basic structure and the pixel information of the image itself to complete the repair and filling of the content of the defect area.
The existing method is three: the document "unserviced restoration left with Deep adaptive genetic additive Networks" (author Radford Alec, Metz Luke, Chintala Soumith) proposes an image restoration model based on Deep convolution generation of a countermeasure network, which includes two models: a generating model G used for capturing data distribution and a discriminant model D used for estimating the probability that the sample comes from training data instead of G, and the image restoration is realized based on the mode of the game fighting.
The prior method comprises the following steps: the document "Text-Guided Neural Image Inpainting" (author Lisai Zhang, Qingcai Chen, Baotian Hu, Shuoran Jiang) proposes an Image restoration algorithm based on character guidance, which takes descriptive Text as a condition, compares semantic contents of a given Text and residual images through an attention mechanism, and finds out semantic contents filled for defective parts to carry out Image restoration.
The first reference method comprises the following steps: and removing the attention mechanism module, and only fusing the tactile features and the defect image features to realize the image restoration. The method can be used to verify the importance of the attention mechanism module.
The invention comprises the following steps: the method of the present embodiment.
The quality of the repaired image is evaluated mainly from two aspects of subjective qualitative evaluation and objective quantitative evaluation.
First, in terms of subjective qualitative assessment, fig. 4 shows image restoration effect graphs of AGVI models of the comparison scheme and the method of the present invention, respectively. The defect rate in this experiment was 40% and the size of the defect region was about 51 × 51. Each row of images represents the original image, the defect image, the prior method I, the prior method II, the prior method III, the prior method IV, the reference method I and the AGVI model of the method from left to right. As can be seen from fig. 4, compared with the depth image restoration method, the restoration result of the first conventional method has a poor visual perception effect, the texture cannot be clearly resolved, and the problems of image blur and detail loss are also obvious. The second existing method based on the self-encoder only has accurate and reasonable color repair to the damaged area, and the details of the structure and the texture are still very fuzzy. Compared with the first method and the second method, the repair result shows clearer texture and structural features, but has more obvious blurring and artifact phenomena at the edge. In the repair result of the AGVI model of the method, the semantic information of the repair area and the whole image is relatively fit, the artifacts and the fuzzy phenomenon at the edge disappear, and the repair effect is more vivid. Meanwhile, compared with the first reference method, the designed model has more complete texture and structure information and the image restoration performance is optimal.
Table I evaluation results of the present invention show
Method | MSE | SSIM |
Prior document I | 130.230 | 0.587 |
Prior art document 2 | 125.622 | 0.616 |
Prior art document three | 126.145 | 0.606 |
Prior art document four | 118.547 | 0.657 |
Reference method 1 | 115.369 | 0.623 |
AGVI | 100.750 | 0.845 |
Secondly, in the aspect of objective quantitative evaluation, a common Mean Square Error (MSE) and Structure metric (SSIM) in image quality evaluation indexes are adopted, and the smaller the MAE and the smaller the SSIM, the better the image quality of cross-modal restoration is. Table I shows the MSE scores and SSIM scores of the AGVI model of the invention and other comparative models, and the performance of the AGVI model was evaluated from two perspectives, image perception and structural comparison, respectively. As can be seen from the table, the AGVI model of the invention has the highest MSE score and the highest SSIM score. The first, second and third existing methods mainly perform intra-modal repair based on the information of the defective image, but when the image is severely defective, the repair effect is poor, the MSE score reaches about 130, and the SSIM value is only about 0.6. The existing method four, the reference method one and the AGVI model of the invention verify the content repair capability of the non-image modal information on the defect area and obviously improve the repair effect by the attention mechanism. In particular, the AGVI model designed by the method of the present invention combines the repair capability of the self-encoder with the selective attention capability of the attention mechanism, and the repair results have the best visual quality.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.
Claims (9)
1. A cross-mode image restoration method based on an attention mechanism is characterized by comprising the following steps:
the method comprises the following steps of 1, selecting a multi-modal data set, wherein the multi-modal data set comprises defect image data, real image data and a touch signal, and is divided into a training set and a testing set;
step 2, designing a cross-mode image restoration AGVI model based on an attention mechanism, wherein the model comprises the following steps:
the learnable feature extraction module is used for extracting the features of the tactile signals, the defective image data and the real image data and participating in subsequent end-to-end model training;
the transfer characteristic attention module is used for introducing an attention mechanism, positioning the image defect area and acquiring transfer characteristics representing the defect area;
the relevant embedding learning module is used for constructing a relevant embedding learning space by combining real label information, measuring the difference between a target function minimum prediction label and a real label by adopting classification based on cross entropy while completing a semantic feature learning task by a minimum semantic associated target function, obtaining a total target function of semantic similarity learning among different modalities in a final relevant embedding learning stage, and mining the most relevant tactile features in the extracted tactile features and the image defect area;
The cross-modal image restoration module is used for combining the perception constraint loss function, the appearance constraint loss function and the antagonistic loss function among the pixels and performing cross-modal restoration on the defective image data by utilizing the mined tactile characteristics most related to the defective image area;
step 3, training the cross-modal image restoration AGVI model by using a training set to obtain an optimal cross-modal image restoration AGVI model structure and network parameters;
and 4, performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model by using the touch signals and the defective image data in the test set.
2. The attention-based trans-modal image inpainting method of claim 1, wherein the selecting a multi-modal dataset in step 1 specifically comprises:
selecting three different modality data including defect image data I, real image data R and tactile signals H to form a multi-modal data set D; the real image data is an original color image signal, the touch signal is touch power spectral density obtained by preprocessing the touch original signal, the defect image data is an image with a defect rate of lambda obtained by preprocessing the real image data, and the value range of lambda is between 0 and 1;
Step (1-2), for data of different modes in the multi-mode data set D, carrying out statistics on real label information Y of the data, namely using a one-hot code to print a category label of content information represented by each data;
step (1-3) of randomly selecting data with the proportion of alpha from the multi-modal data set D as a training set DtrThe remaining data of 1-alpha ratio is used as test set DteAnd the value of alpha ranges from 0 to 1.
3. The attention mechanism-based cross-modal image inpainting method of claim 1, wherein the learnable feature extraction module in step 2 extracts features of the haptic signal, the defective image data, and the real image data, and specifically comprises:
for the tactile signal H, a gate cycle unit GRU and a 3-layer fully-connected network are adopted as a tactile mapping network to acquire a tactile feature H and a tactile feature prediction label y(h);
For defect image data I and real image data R, extracting shallow defect image characteristics I and real image characteristics R by adopting an image mapping network formed by a depth convolution neural network, and the specific process is as follows:
4. The cross-modal image inpainting method based on the attention mechanism as claimed in claim 1, wherein the step 2 of acquiring the transfer characteristics characterizing the defect region by the attention transfer module specifically comprises:
step A, defining each characteristic value as a characteristic unit for the defect image characteristic i and the real image characteristic r, namelyAndikand rlRespectively represent the defective image characteristics ik feature values and the l-th feature value in the real image feature r,andrespectively representing the dimensions of the defect image characteristic i and the real image characteristic r; then, normalized inner product is adopted to each i in defect image characteristics ikAnd each of the real image features rlCalculating to obtain cosine similarity between all feature units of the two features, wherein the cosine similarity is specifically represented as follows:
wherein, ck,lExpressing a cosine similarity matrix, | | | | | represents a modular operation, and <, > represents a normalized inner product operation;
step B, transferring the most relevant part of each characteristic unit for characterizing the defect area in the defect image characteristic from the real image characteristic, namely a cosine similarity matrix c for k rows and l columns k,lTakes the maximum value for each column of (a), this process is expressed as:
wherein, akRepresenting a feature metric most correlated with the k-th position of the defect image feature i in the real image feature r for the distraction index;
step C, based on the attention transfer index, performing feature selection operation on the real image features r to acquire transfer characteristics t representing the image defect area from the real image features, wherein the process is represented as follows:
tk=rak,
wherein, tkRepresenting the selection of the a-th in the real image features rkThe characteristic value of each position is transferred to obtain the characteristic value of the kth position in the transfer characteristic t, and then the characteristic value is subjected to characteristic classification through a sigmoid layer to obtain a prediction label y of the transfer characteristic(t)。
5. The attention mechanism-based cross-mode image restoration method according to claim 1, wherein the step 2 of mining the haptic features most relevant to the image defect area in the haptic features by the relevant embedded learning module specifically comprises:
step A, constructing a relevant embedding learning space by using real label information Y ═ Y }, Y ∈ 1,2, … and C to finish a semantic feature learning task, wherein the process mainly comprises the step of minimizing a semantic association objective function LremThe realization method comprises the following steps:
δpq=hp*tq,
wherein y represents the true label, C represents the total number of training data classes, N represents the total number of training data, s pqIs a class association factor, δpqIn order to be a feature association factor, the feature association factor,anda predictive label for the p-th haptic feature and a predictive label for the q-th branch feature, respectively, (. cndot.)TDenotes a transpose operation, hpAnd tqRespectively representing the p tactile feature and the q transfer feature; the semantic association objective function ensures that in the relevant embedded learning space, transfer features with the same semantics can assist the haptic sense to proceedSemantic feature learning, namely extracting the haptic features with the highest correlation degree with the image defect area from the haptic features;
step B, in order to further enhance the feature resolution capability, the difference between the predicted label and the real label is minimized by adopting a loss function based on cross entropy to complete semantic classification, and the process is expressed as follows:
wherein L is1And L2Classification metric objective function, y, representing transfer and haptic features, respectivelypAndtrue and predicted labels, y, representing respectively the p-th haptic featureqAndtrue and predicted labels representing the qth branch feature, respectively;
step C, the step A and the step B are combined to obtain a total objective function of semantic similarity learning among different modes in a relevant embedding learning stage, and the total objective function is expressed as:
Lsim=Lrem+α1L1+α2L2,
wherein L issimFor the final semantic similarity loss function, α 1And alpha2Is a hyperparameter, LremTo represent the semantic relatedness between two modalities.
6. The attention-based cross-modal image inpainting method of claim 1, wherein the cross-modal image inpainting module in step 2 performs cross-modal inpainting on the defective image data by using the mined tactile features most relevant to the defective image region, and specifically comprises:
step A, adopting a decoder network to realize cross-modal image restoration; specifically, adding the tactile feature h and the defect image feature i to obtain a repaired image featureThen restoring the image featuresInputting the decoder De to obtain a restored imageNamely, it isThe restoration process is restrained from two aspects of appearance characteristics and perception characteristics by utilizing real image data, so that the restored imageAs similar as possible to the real image R, this process is expressed as:
wherein, thetadeSet of network parameters, L, for the decoder DeaAnd LpRespectively, appearance constraint loss functionAnd a perceptual constraint loss function that,is a function of distribution of the restored imageIn the expectation that the position of the target is not changed,is a feature extraction network similar to VGG, | | · |. non-woven phosphor1An L1 norm operation is taken;
b, restraining the distribution structure of the repaired image in the step A; specifically, the distribution of real image data is learned using a countermeasure loss function, defining a countermeasure loss function L advExpressed as:
wherein L isadvIs a penalty function of the discriminator D, θdIs a set of network parameters for the arbiter D,andare respectively the true image distribution function Pdata(R)And a repair image distribution functionD (R; theta)d) Andthe probabilities that the real image and the restored image are true are respectively identified for the discriminator D;
step C, synthesizing appearance constraint loss functions L in step A and step BaPerceptual constraint loss function LpAnd a penalty function LadvThe loss function that the decoder De ultimately needs to minimize is:
Limp=La+β1Lp+β2Ladv,
wherein L isimpAs a function of the final repair loss, beta1And beta2Are all hyper-parameters.
7. The cross-modal image restoration method based on the attention mechanism according to claim 1, wherein the training of the cross-modal image restoration AGVI model by using the training set in the step 3 specifically comprises:
step (3-1) of restoring the imageCombining the training set divided in step 1 and the real label information in step (2-3) into a standardized input training data set Dtr:
Dtr={(In,Rn,Hn,yn),n=1,2,…,N},
Wherein, ynDefective image data I for the n-th group participating in trainingnTrue image data RnAnd tactile signal HnN is the total capacity of training data;
step (3-2), initializing a network parameter set of the AGVI model, wherein the set comprises theta h,θi,θde,θdInitializing these parameters to a standard normal distribution; wherein, thetahAnd thetaiParameter sets for the haptic mapping network and the image mapping network, respectively; thetadeSet of network parameters, θ, for the decoder DedA set of network parameters for discriminator D;
step (3-3), setting the total iteration number as G, and recording the specific iteration number by using G;
step (3-4), training a cross-modal image restoration AGVI model by adopting a random gradient descent method, and specifically comprising the following steps:
firstly, setting a hyper-parameter alpha1,α2,β1,β2Learning rate μ for haptic mapping network and image mapping network1Learning rate mu of decoder and discriminator2;
Step two, calculating the output of each network in the cross-modal image restoration AGVI model:
h=Fh(H;θh);i=Fi(I;θi);r=Fi(R;θi),
step three, starting iteration; updating the parameter set of each network from the direction of the negative gradient of the target based on a gradient descent method and an Adam optimizer:
wherein L issim(v) is the final semantic similarity loss function Lsim,Limp(v) is the final repair loss function Limp;Andhaptic mapping network F after g +1 and g iterations, respectivelyhImage mapping network FiDecoder network De, network parameter set of discriminator network D; v is a derivative;
if G is less than G, jumping to the step (3-4), and continuing the next iteration, wherein G is G + 1; otherwise, terminating the iteration;
And (3-5) finally outputting the structure and the network parameters of the optimal cross-mode image restoration AGVI model after G-round iteration.
8. The cross-modal image restoration method based on the attention mechanism according to claim 1, wherein the step 4 of performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model specifically comprises:
step (4-1), test set D divided in step 1teComprises the following steps:
Dte={(I'j,H'j),j=1,2,…,F},
wherein, I'jAnd H'jDefective image data and tactile signal I 'of the j-th group pair'jAnd H'jThe method is used for model testing, and F is the total amount of test data;
step (4-2), test set DteThe data in the step (3) are input in pairs into the optimal cross-mode image restoration AGVI model trained in the step (3), and the output is the restored image.
9. An attention-based cross-modal image restoration device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the computer program, when loaded into the processor, implements an attention-based cross-modal image restoration method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210205553.5A CN114677311A (en) | 2022-03-03 | 2022-03-03 | Cross-mode image restoration method and device based on attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210205553.5A CN114677311A (en) | 2022-03-03 | 2022-03-03 | Cross-mode image restoration method and device based on attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114677311A true CN114677311A (en) | 2022-06-28 |
Family
ID=82072316
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210205553.5A Pending CN114677311A (en) | 2022-03-03 | 2022-03-03 | Cross-mode image restoration method and device based on attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114677311A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2606836A (en) * | 2021-03-15 | 2022-11-23 | Adobe Inc | Generating modified digital images using deep visual guided patch match models for image inpainting |
CN116523799A (en) * | 2023-07-03 | 2023-08-01 | 贵州大学 | Text-guided image restoration model and method based on multi-granularity image-text semantic learning |
CN116630203A (en) * | 2023-07-19 | 2023-08-22 | 科大乾延科技有限公司 | Integrated imaging three-dimensional display quality improving method |
CN116681980A (en) * | 2023-07-31 | 2023-09-01 | 北京建筑大学 | Deep learning-based large-deletion-rate image restoration method, device and storage medium |
CN117853492A (en) * | 2024-03-08 | 2024-04-09 | 厦门微亚智能科技股份有限公司 | Intelligent industrial defect detection method and system based on fusion model |
-
2022
- 2022-03-03 CN CN202210205553.5A patent/CN114677311A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2606836A (en) * | 2021-03-15 | 2022-11-23 | Adobe Inc | Generating modified digital images using deep visual guided patch match models for image inpainting |
GB2606836B (en) * | 2021-03-15 | 2023-08-02 | Adobe Inc | Generating modified digital images using deep visual guided patch match models for image inpainting |
CN116523799A (en) * | 2023-07-03 | 2023-08-01 | 贵州大学 | Text-guided image restoration model and method based on multi-granularity image-text semantic learning |
CN116523799B (en) * | 2023-07-03 | 2023-09-19 | 贵州大学 | Text-guided image restoration model and method based on multi-granularity image-text semantic learning |
CN116630203A (en) * | 2023-07-19 | 2023-08-22 | 科大乾延科技有限公司 | Integrated imaging three-dimensional display quality improving method |
CN116630203B (en) * | 2023-07-19 | 2023-11-07 | 科大乾延科技有限公司 | Integrated imaging three-dimensional display quality improving method |
CN116681980A (en) * | 2023-07-31 | 2023-09-01 | 北京建筑大学 | Deep learning-based large-deletion-rate image restoration method, device and storage medium |
CN116681980B (en) * | 2023-07-31 | 2023-10-20 | 北京建筑大学 | Deep learning-based large-deletion-rate image restoration method, device and storage medium |
CN117853492A (en) * | 2024-03-08 | 2024-04-09 | 厦门微亚智能科技股份有限公司 | Intelligent industrial defect detection method and system based on fusion model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114677311A (en) | Cross-mode image restoration method and device based on attention mechanism | |
Dai et al. | Human action recognition using two-stream attention based LSTM networks | |
Jinjin et al. | Pipal: a large-scale image quality assessment dataset for perceptual image restoration | |
CN109544524B (en) | Attention mechanism-based multi-attribute image aesthetic evaluation system | |
CN113627482B (en) | Cross-modal image generation method and device based on audio-touch signal fusion | |
Huang et al. | Medical image segmentation using deep learning with feature enhancement | |
Ji et al. | Blind image quality assessment with semantic information | |
CN115546171A (en) | Shadow detection method and device based on attention shadow boundary and feature correction | |
CN115359074A (en) | Image segmentation and training method and device based on hyper-voxel clustering and prototype optimization | |
Li et al. | Spectral feature fusion networks with dual attention for hyperspectral image classification | |
Mi et al. | KDE-GAN: A multimodal medical image-fusion model based on knowledge distillation and explainable AI modules | |
Chen et al. | Video‐based action recognition using spurious‐3D residual attention networks | |
Hu et al. | Hierarchical discrepancy learning for image restoration quality assessment | |
Liu et al. | Dunhuang murals contour generation network based on convolution and self-attention fusion | |
Zeng et al. | Combining CNN and transformers for full-reference and no-reference image quality assessment | |
Qin et al. | Virtual reality video image classification based on texture features | |
CN116630964A (en) | Food image segmentation method based on discrete wavelet attention network | |
Sobal et al. | Joint embedding predictive architectures focus on slow features | |
Mu et al. | Underwater image enhancement using a mixed generative adversarial network | |
Chen et al. | Rethinking visual reconstruction: Experience-based content completion guided by visual cues | |
Li et al. | No‐reference image quality assessment based on multiscale feature representation | |
Han et al. | Blind image quality assessment with channel attention based deep residual network and extended LargeVis dimensionality reduction | |
Pirabaharan et al. | Improving interactive segmentation using a novel weighted loss function with an adaptive click size and two-stream fusion | |
Ye et al. | Human action recognition method based on Motion Excitation and Temporal Aggregation module | |
Lyra et al. | A multilevel pooling scheme in convolutional neural networks for texture image recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |