CN114677311A - Cross-mode image restoration method and device based on attention mechanism - Google Patents

Cross-mode image restoration method and device based on attention mechanism Download PDF

Info

Publication number
CN114677311A
CN114677311A CN202210205553.5A CN202210205553A CN114677311A CN 114677311 A CN114677311 A CN 114677311A CN 202210205553 A CN202210205553 A CN 202210205553A CN 114677311 A CN114677311 A CN 114677311A
Authority
CN
China
Prior art keywords
image
cross
feature
modal
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210205553.5A
Other languages
Chinese (zh)
Inventor
魏昕
姚玉媛
周亮
高赟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210205553.5A priority Critical patent/CN114677311A/en
Publication of CN114677311A publication Critical patent/CN114677311A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/73Deblurring; Sharpening
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/529Depth or shape recovery from texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a cross-mode image restoration method and a cross-mode image restoration device based on an attention mechanism, wherein the method comprises the following steps: selecting a multi-mode data set comprising defect image data, real image data and a touch signal, and dividing the data set into a training set and a test set; designing an attention mechanism-based cross-modal image restoration AGVI model, wherein the model comprises four modules of learnable feature extraction, feature attention transfer, correlation embedding learning and cross-modal image restoration; training the cross-modal image restoration AGVI model by using a training set to obtain an optimal cross-modal image restoration AGVI model structure and network parameters; and performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model by using the touch signal and the defect image data in the test set. According to the method, an attention mechanism is introduced, the image defect area is accurately positioned, the key information in the touch signal is utilized to repair, predict and fill the area, and high-quality and fine-grained repair of the image is achieved.

Description

Cross-mode image restoration method and device based on attention mechanism
Technical Field
The invention relates to a cross-mode image restoration method and device based on an attention mechanism, and belongs to the technical field of image restoration.
Background
In recent years, with the rapid development of wireless communication and multimedia technologies, multimedia audiovisual services such as ultra-high definition video and network live broadcast have basically met audiovisual requirements of users. In order to further pursue more sophisticated interactive feelings and scene experiences, more and more users are exploring sensory experiences other than audio-visual, such as touch, smell, taste, etc. With the rapid development of the haptic internet under 5G communication, a new haptic service, namely a multi-modal service, is merged into the traditional audio-visual service, and immersive experience is provided for users in various application scenes such as virtual games, remote education, rehabilitation and medical treatment. In order to support such multi-modal services composed of audio, video and touch, it is very important to realize efficient transmission of heterogeneous code streams. However, in actual transmission, due to the influence of noise, transmission loss and other factors, the receiving end often has a serious problem of image data distortion or loss, and the quality of the multi-mode signal is difficult to be guaranteed.
At present, the problems of image distortion and partial defects in the transmission process are mainly considered from image restoration. The mainstream image restoration method mainly comprises the following steps: 1) based on the signal processing; 2) based on a loss function; 3) three categories based on depth models.
Firstly, image restoration is performed by using typical signal processing techniques, such as low-rank matrix, sparse representation, and the like, and the main idea is to represent an original data matrix as a low-rank part or a sparse part, and analyze and apply the low-rank part or the sparse part. For example, researchers have proposed a sparse optimization method "l0TV-PADMM ", the method converts the image distortion into the problem of equivalent biconvex Mathematical Programming (MPEC) with balanced constraint, finds specific signals with similar structures by using sparse representation, and estimates the defect content of the image. In order to further utilize the local and non-local characteristics of the image, a learner applies transformation learning to the adaptive sparse representation, and an image restoration scheme based on joint low-rank regularization transformation learning is provided. The method is mainly based on signal processing technology to carry out image restoration, has simple principle and simple and convenient operation, but has poor image restoration effect and low quality.
Secondly, the image restoration process is normalized by strengthening the constraint action of the loss function. For example, a scholars fully utilizes the priori knowledge of the visual image to construct a mixed regularized loss function, and searches for error fitting of all image frames through regularization terms of a spatial domain and a spectral domain, so as to realize high-quality image restoration. To further enhance the controllability of image denoising, researchers have also introduced a diversity objective function that correlates input noise with image semantics, minimizes the restoration distance between images, and allows users to manipulate the output image by manually adjusting the noise. The constraint of the method on the loss function strengthens the image restoration process, is beneficial to improving the quality of the restored image, but has defects on the detail capture of the image restoration.
And finally, realizing accurate image restoration by virtue of the superior performance of deep learning. With the rapid development of artificial intelligence technology, deep learning has become the mainstream technology of image restoration at present due to its strong learning ability. For example, researchers have designed a quality enhancement network based on a depth model, which employs a residual network and recursive learning, significantly reduces image artifacts with similar frequency characteristics, and removes noise-induced distortion and blurring. In addition, the denoising problem is modeled into a function optimization process with design cost, model parameters are estimated through semi-supervised learning based on a large number of defect images and a small number of labeled training samples, noise is accurately and effectively removed, and high-quality image restoration is achieved. The method can often achieve a good repairing effect, but has the serious problems of high model complexity, large calculation amount and the like.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art and provides a cross-mode image restoration method and a device based on an attention mechanism, wherein a touch signal is applied to the cross-mode image restoration method and the device; specifically, the method selects a standard multi-Modal dataset model for training and testing, establishes an AGVI (Attention-Guided Cross-Modal Visual Image Inpainting) model based on the tactile signals, and realizes Cross-Modal Image restoration. By adopting the method and the device, the image defect area can be accurately positioned, the high-quality and fine-grained image can be restored under the condition of low model complexity, the image restoration quality is improved, and the terminal service level and the user immersive experience are ensured.
The invention specifically adopts the following technical scheme to solve the technical problems:
a cross-mode image restoration method based on an attention mechanism comprises the following steps:
the method comprises the following steps that 1, a multi-mode data set is selected, wherein the multi-mode data set comprises defect image data, real image data and touch signals, and the multi-mode data set is divided into a training set and a testing set;
step 2, designing a cross-mode image restoration AGVI model based on an attention mechanism, wherein the model comprises the following steps:
the learnable feature extraction module is used for extracting the features of the tactile signals, the defective image data and the real image data and participating in subsequent end-to-end model training;
the transfer characteristic attention module is used for introducing an attention mechanism, positioning the image defect area and acquiring transfer characteristics representing the defect area;
the relevant embedded learning module is used for constructing a relevant embedded learning space by combining the information of the real labels, measuring the difference between the target function minimum prediction label and the real label by adopting classification based on cross entropy while completing a semantic feature learning task by minimizing a semantic associated target function, obtaining a total target function of semantic similarity learning between different modalities in a final relevant embedded learning stage, and excavating the tactile features which are most relevant to an image defect area in the tactile features;
The cross-modal image restoration module is used for performing cross-modal restoration on the defective image data by utilizing the mined tactile characteristics most relevant to the defective image area in combination with the inter-pixel perception constraint loss function, the appearance constraint loss function and the countermeasure loss function;
step 3, training the cross-modal image restoration AGVI model by using a training set to obtain an optimal cross-modal image restoration AGVI model structure and network parameters;
and 4, performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model by using the touch signals and the defective image data concentrated in the test.
Further, as a preferred technical solution of the present invention, the selecting a multi-modal dataset in step 1 specifically includes:
selecting three different modal data including defect image data I, real image data R and tactile signals H to form a multi-modal data set D; the real image data is an original color image signal, the touch signal is touch power spectral density obtained by preprocessing the touch original signal, the defect image data is an image with a defect rate of lambda obtained by preprocessing the real image data, and the value range of lambda is between 0 and 1;
Step (1-2), for data of different modes in the multi-mode data set D, carrying out statistics on real label information Y of the data, namely using a one-hot code to mark a category label of content information represented on each data;
step (1-3) of randomly selecting data with the proportion of alpha from the multi-modal data set D as a training set DtrThe remaining data of 1-alpha ratio is used as test set DteAnd the value of alpha ranges from 0 to 1.
Further, as a preferred technical solution of the present invention, the extracting characteristics of the haptic signal, the defective image data, and the real image data by the learnable characteristic extracting module in step 2 specifically includes:
for the tactile signal H, a gate cycle unit GRU and a 3-layer fully-connected network are adopted as a tactile mapping network to acquire a tactile feature H and a tactile feature prediction label y(h)
Extracting shallow defect image characteristics I and real image characteristics R by adopting an image mapping network formed by a depth convolution neural network for defect image data I and real image data R; the specific process is as follows:
Figure BDA0003530028110000031
Figure BDA0003530028110000032
Figure BDA0003530028110000033
wherein, h, i and r are respectively tactile feature, defect image feature and real image feature, and the dimensionality of h, i and r is respectively
Figure BDA0003530028110000034
And
Figure BDA0003530028110000035
θhand theta iRespectively a haptic mapping network Fh(H;θh) And an image mapping network Fi(I/R;θi) The parameter set of (2).
Further, as a preferred technical solution of the present invention, the acquiring, by the attention transfer module in step 2, a transfer characteristic characterizing the defect region specifically includes:
step A, defining each characteristic value as a characteristic unit for the defect image characteristic i and the real image characteristic r, namely
Figure BDA0003530028110000041
And
Figure BDA0003530028110000042
ikand rlRespectively representing the kth characteristic value in the defect image characteristic i and the l characteristic value in the real image characteristic r,
Figure BDA0003530028110000043
and
Figure BDA0003530028110000044
respectively representing the dimensions of the defect image characteristic i and the real image characteristic r; then, normalized inner product is adopted to each i in defect image characteristics ikAnd each of the real image features rlCalculating to obtain cosine similarity between all feature units of the two features, wherein the cosine similarity is specifically represented as follows:
Figure BDA0003530028110000045
wherein, ck,lExpressing cosine similarity matrix, | | | | represents modulus operation, < |, > represents normalized inner product operation;
step B, transferring the most relevant part of each characteristic unit for characterizing the defect area in the defect image characteristic from the real image characteristic, namely a cosine similarity matrix c for k rows and l columnsk,lTakes the maximum value, this process is expressed as:
Figure BDA0003530028110000046
Wherein, akRepresenting a feature metric most correlated with the k-th position of the defect image feature i in the real image feature r for the distraction index;
step C, based on the attention transfer index, performing feature selection operation on the real image features r to acquire transfer characteristics t representing the image defect area from the real image features, wherein the process is represented as follows:
tk=rak
wherein, tkRepresenting the a-th of the selected real image features rkThe characteristic value of each position is transferred to obtain the characteristic value of the kth position in the transfer characteristic t, and then the characteristic value is subjected to characteristic classification through a sigmoid layer to obtain a prediction label y of the transfer characteristic(t)
Further, as a preferred technical solution of the present invention, the mining, by the associated embedded learning module, the haptic characteristics most relevant to the image defect region in the haptic characteristics in step 2 specifically includes:
step A, constructing a relevant embedding learning space by using real label information Y ═ Y }, Y ∈ 1,2, … and C to finish a semantic feature learning task, wherein the process mainly comprises the step of minimizing a semantic association objective function LremThe realization is as follows:
Figure BDA0003530028110000047
Figure BDA0003530028110000048
δpq=hp*tq
wherein y represents a true label, C represents a total number of training data classes, N represents a total number of training data, spqIs a class association factor, δ pqIn order to be a feature association factor, the feature association factor,
Figure BDA0003530028110000051
and
Figure BDA0003530028110000052
predictive labels for the pth haptic feature and the qth branch feature, respectively, (.)TDenotes a transpose operation, hpAnd tqRespectively representing the pth tactile feature and the qth transfer feature; the semantic association objective function ensures that in a relevant embedding learning space, the transfer features with the same semantic can assist the touch sense to carry out semantic feature learning, namely the touch features with the highest degree of correlation with the image defect area are extracted from the touch features;
step B, in order to further enhance the feature resolution capability, the difference between the predicted label and the real label is minimized by adopting a loss function based on cross entropy to complete semantic classification, and the process is expressed as follows:
Figure BDA0003530028110000053
Figure BDA0003530028110000054
wherein L is1And L2Classification metric objective function, y, representing transfer and haptic features, respectivelypAnd
Figure BDA0003530028110000055
true and predicted labels, y, representing respectively the p-th haptic featureqAnd
Figure BDA0003530028110000056
true and predicted labels representing the qth branch feature, respectively;
step C, the step A and the step B are combined to obtain a total objective function of semantic similarity learning among different modes in a relevant embedding learning stage, and the total objective function is expressed as:
Lsim=Lrem1L12L2
wherein L issimFor the final semantic similarity loss function, α 1And alpha2Is hyperparametric, LremTo represent the semantic relatedness between the two modalities.
Further, as a preferred technical solution of the present invention, the step 2 of performing cross-modal image restoration on the defective image data by using the mined haptic features most related to the defective image area in the step 2 specifically includes:
step A, adopting a decoder network to realize cross-modal image restoration; specifically, adding the tactile feature h and the defective image feature i to obtain a repaired image feature
Figure BDA0003530028110000057
Then restoring the image features
Figure BDA0003530028110000058
Inputting the decoder De to obtain a restored image
Figure BDA0003530028110000059
Namely, it is
Figure BDA00035300281100000510
The restoration process is restrained from two aspects of appearance characteristics and perception characteristics by utilizing real image data, so that the restored image
Figure BDA00035300281100000511
As similar as possible to the real image R, this process is expressed as:
Figure BDA00035300281100000512
Figure BDA00035300281100000513
Figure BDA0003530028110000061
wherein, thetadeSet of network parameters, L, for the decoder DeaAnd LpRespectively an appearance constraint loss function and a perceptual constraint loss function,
Figure BDA0003530028110000062
is a function of distribution of the restored image
Figure BDA0003530028110000063
In the expectation that the position of the target is not changed,
Figure BDA0003530028110000064
is a feature extraction network similar to VGG, | | · |. non-woven phosphor1An L1 norm operation is taken;
b, restraining the distribution structure of the repaired image in the step A; specifically, the distribution of real image data is learned using a countermeasure loss function, defining a countermeasure loss function L advExpressed as:
Figure BDA0003530028110000065
wherein L isadvIs a penalty function of the discriminator D, thetadFor the set of network parameters of the arbiter D,
Figure BDA0003530028110000066
and
Figure BDA0003530028110000067
respectively, the true image distribution function Pdata(R)And repairing the image distribution function
Figure BDA0003530028110000068
D (R; theta)d) And
Figure BDA0003530028110000069
a rule for discriminating the true image and the restored image respectively for the discriminator DRate;
step C, synthesizing appearance constraint loss functions L in step A and step BaPerceptual constraint loss function LpAnd a penalty function LadvThe loss function that the decoder De ultimately needs to minimize is:
Limp=La1Lp2Ladv
wherein L isimpAs a function of the final repair loss, beta1And beta2Are all hyper-parameters.
Further, as a preferred technical solution of the present invention, the training of the cross-modal image restoration AGVI model in step 3 by using a training set specifically includes:
step (3-1) of restoring the image
Figure BDA00035300281100000610
Combining the training set divided in step 1 and the real label information in step (2-3) into a standardized input training data set Dtr
Dtr={(In,Rn,Hn,yn),n=1,2,…,N},
Wherein, ynDefective image data I for the n-th group participating in trainingnTrue image data RnAnd tactile signal HnN is the total capacity of training data;
step (3-2), initializing a network parameter set of the AGVI model, wherein the set comprises theta hidedInitializing these parameters to a standard normal distribution; wherein, thetahAnd thetaiParameter sets for the haptic mapping network and the image mapping network, respectively; thetadeSet of network parameters, θ, for the decoder DedA set of network parameters for discriminator D;
step (3-3), setting the total iteration number as G, and recording the specific iteration number by using G;
step (3-4), training a cross-modal image restoration AGVI model by adopting a random gradient descent method, and specifically comprising the following steps:
firstly, setting a hyper-parameter alpha1212Learning rate μ for haptic mapping network and image mapping network1Learning rate mu of decoder and discriminator2
Step two, calculating the output of each network in the cross-modal image restoration AGVI model:
h=Fh(H;θh);i=Fi(I;θi);r=Fi(R;θi),
Figure BDA0003530028110000071
step three, starting iteration; updating the parameter set of each network from the negative gradient direction of the target based on a gradient descent method and an Adam optimizer:
Figure BDA0003530028110000072
Figure BDA0003530028110000073
Figure BDA0003530028110000074
Figure BDA0003530028110000075
wherein L issim(v) is the final semantic similarity loss function Lsim,Limp(v) is the final repair loss function Limp
Figure BDA0003530028110000076
And
Figure BDA0003530028110000077
haptic mapping network F after g +1 and g iterations, respectivelyhImage mapping network FiDecoder network De, network parameter set of discriminator network D;
Figure BDA0003530028110000078
is a derivative;
if G is smaller than G, skipping to the step (3-4) and continuing the next iteration, wherein G is G + 1; otherwise, terminating the iteration;
And (3-5) after G-round iteration, finally outputting the structure and network parameters of the optimal cross-modal image restoration AGVI model.
Further, as a preferred technical solution of the present invention, the performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model in step 4 specifically includes:
step (4-1) and test set D divided in step 1teComprises the following steps:
Dte={(I'j,H'j),j=1,2,…,F},
wherein, I'jAnd H'jDefective image data and tactile signal, I ', forming a pair of the j-th group'jAnd H'jThe method is used for model testing, and F is the total amount of test data;
step (4-2), test set DteThe data in (3) are input in pairs into the optimal cross-mode image restoration AGVI model trained in the step (3), and the output is the restoration image.
The invention also provides an attention-based cross-modal image restoration device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the attention-based cross-modal image restoration method when being loaded to the processor.
By adopting the technical scheme, the invention can produce the following technical effects:
(1) in order to accurately position the defect area of the image, the invention introduces an attention mechanism to carry out weight distribution, focuses on the defect area of the image so as to fully obtain the transfer characteristic of the characteristic defect area and improve the accuracy and the integrity of model repair.
(2) When the similarity of different modes is measured by the model, common semantic and label information is introduced, the difference between transfer characteristics and tactile characteristics is continuously reduced through the double constraints of semantic association and category measurement objective functions, and the partial tactile characteristics with the highest correlation degree with an image defect area in the tactile characteristics are extracted and used for cross-mode restoration of defect image data.
(3) The invention uses a decoder network structure, and realizes the high-quality and fine-grained comprehensive repair of the defective image through a perception constraint loss function and an appearance constraint loss function among pixels and an antagonistic loss function based on image distribution under the constraint of semantic and supervision information. Meanwhile, the image quality is improved from the semantics and the distribution on the basis of not increasing the complexity of the model.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
FIG. 2 is a schematic structural diagram of a cross-modal image restoration model based on an attention mechanism according to the present invention.
FIG. 3 is a structural frame diagram of the device of the present invention.
FIG. 4 is a graph showing the results of comparing the method of the present invention with the conventional method.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
An efficient and accurate cross-mode image restoration method needs to be drawn up, a defect area can be accurately positioned, effective information in a touch signal is mined, and high-quality image data restoration is achieved. In recent years, encoder-decoder networks have achieved good success in the field of image restoration, and attention mechanisms also provide a simple and efficient way for accurately positioning image defect areas. Therefore, the invention provides a cross-mode image restoration method based on an attention mechanism. Based on the selective attention capacity of an attention mechanism, the image defect region can be accurately positioned in a focused manner, and the transfer characteristic of the region is obtained; the correlation embedding learning module is combined with real label information to construct a correlation embedding learning space, when a semantic feature learning task is completed by a minimum semantic correlation target function, the difference between a target function minimum prediction label and a real label is measured by adopting classification based on cross entropy, a final total target function of semantic similarity learning among different modalities in a correlation embedding learning stage is obtained, and the most relevant tactile features in the tactile features and the image defect region are mined; under the constraints of a perception constraint loss function, an appearance constraint loss function and a resistance loss function which fit the difference and the distribution difference among pixels, the generated model based on the self-encoder utilizes the mined tactile features most relevant to the image defect region to realize the cross-mode restoration of the defect image data, and improves the image quality from the semantics and the distribution on the basis of not increasing the complexity of the model.
Specifically, as shown in fig. 1, the present invention relates to a cross-modal image restoration method based on attention mechanism, which specifically includes the following steps:
the method comprises the following steps of 1, selecting a multi-mode data set, wherein the multi-mode data set comprises defect image data, real image data and touch signals, and is divided into a training set and a testing set, and the method specifically comprises the following steps:
selecting three different modality data including defect image data I, real image data R and tactile signals H to form a multi-modal data set D; the real image data is an original color image signal, the tactile signal is tactile power spectral density obtained by preprocessing the tactile original signal, the defect image data is an image with a defect rate of lambda obtained by preprocessing the real image data, and the lambda is 40%.
And (1-2) for the data of different modes in the multi-mode data set D, counting the real label information Y of the data, namely using a one-hot code to print the category label of the content information represented by each data.
Step (1-3) of randomly selecting data with the proportion of alpha from the multi-modal data set D as trainingExercise and Collection DtrThe remaining data of 1-alpha ratio is used as test set D teHere, α is 0.8.
Step 2, designing an attention-based cross-modal image restoration AGVI model, as shown in fig. 2, where the model includes four modules, namely a learnable feature extraction module, a feature attention transfer module, a correlation embedding learning module, and a cross-modal image restoration module, and the modules have functions of: firstly, extracting bottom layer characteristics of real image data, defect image data and tactile signals; secondly, according to the selective attention capacity of an attention mechanism, an image defect area is mainly positioned, and transfer characteristics of the area are obtained; then, introducing similarity measurement and supervision information, constructing a relevant embedded learning space by combining real label information under the constraint of semantic association and category measurement loss functions, minimizing a semantic feature learning task by using a semantic association target function, minimizing the difference between a predicted label and a real label by using a cross entropy-based classification measurement target function, obtaining a total target function of semantic similar learning between different modalities in a final relevant embedded learning stage, and excavating a haptic feature part which is most relevant to an image defect area in haptic features as far as possible; and finally, combining an inter-pixel perception constraint loss function, an appearance constraint loss function and an antagonistic loss function, and performing cross-mode restoration on defective image data by utilizing the mined tactile characteristics most related to the defective image area to ensure the quality of a terminal video signal, wherein the method specifically comprises the following steps:
Step (2-1), a learnable feature extraction module, which is used for extracting the features of the tactile signal, the defective image data and the real image data and participating in the subsequent end-to-end model training, and the concrete implementation process is as follows:
for the tactile signal H, a gate cycle Unit (GRU) and a 3-layer fully-connected network are used as a tactile mapping network to extract a tactile feature H, the GRU network consists of a reset gate and an update gate, the number of the set units is 128, the output dimensions of the fully-connected layer are 1024, 128 and 8, a 128-dimensional tactile feature H is output, and the last layer of the fully-connected layer is a sigmoid layer and used for outputting a tactile feature prediction label y(h)
For defect image data I and real image data R with a size of 128 × 128, in order to ensure consistency of distribution of defect images and real image features, a deep convolutional neural network is used as an image mapping network to extract hierarchical features, the network comprises 3 convolutional layers and 3 full-connected layers, the number of convolutional kernels is 256, 128 and 64, the size of the convolutional kernel is 4x4, the output dimension of the full-connected layer is 1024 and 128, and the last full-connected layer outputs 128-dimensional defect image features I and real image features R, and the specific process is as follows:
Figure BDA0003530028110000101
Figure BDA0003530028110000102
Figure BDA0003530028110000103
In the above formula, h, i and r are the tactile feature, the defective image feature and the real image feature, respectively, and the dimensions of h, i and r are
Figure BDA0003530028110000104
And
Figure BDA0003530028110000105
θhand thetaiRespectively haptic mapping network Fh(H;θh) And an image mapping network Fi(I/R;θi) The parameter set of (2).
Step (2-2), a transfer characteristic attention module for introducing an attention mechanism, accurately positioning the image defect area and acquiring the transfer characteristic representing the defect area, wherein the specific implementation process is as follows:
step A, defining each characteristic value as a characteristic unit for the defect image characteristic i and the real image characteristic r, namely
Figure BDA0003530028110000106
And
Figure BDA0003530028110000107
ikand rlRespectively representing the kth characteristic value in the defect image characteristic i and the l characteristic value in the real image characteristic r,
Figure BDA0003530028110000108
and
Figure BDA0003530028110000109
representing dimensions of the defect image feature i and the real image feature r, respectively. Then, normalized inner product is adopted to each i in defect image characteristics ikAnd each of the real image features rlCalculating to obtain cosine similarity between all feature units of the two features, wherein the cosine similarity is specifically represented as follows:
Figure BDA00035300281100001010
wherein, ck,lExpressing cosine similarity matrix, | | | | | represents modular operation, < · represents normalized inner product operation.
Step B, transferring the most relevant part of each characteristic unit for characterizing the defect area in the defect image characteristic from the real image characteristic, namely a cosine similarity matrix c for k rows and l columns k,lTakes the maximum value, this process is expressed as:
Figure BDA00035300281100001011
wherein, akFor distractive indexing, the feature metric that is most correlated with the k-th position of the defective image feature i in the real image feature r is represented.
Step C, based on the attention transfer index, performing feature selection operation on the real image features r to acquire transfer characteristics t representing the image defect area from the real image features, wherein the process is represented as follows:
tk=rak
wherein, tkRepresenting the a-th of the selected real image features rkThe characteristic value of the k-th position in the transfer characteristic t is obtained by transferring the characteristic value of each position. Finally, in order to enhance the distinguishing capability of the transfer characteristics, the transfer characteristics are classified through a sigmoid layer to obtain a prediction label y of the transfer characteristics(t)
Step (2-3), a correlation embedding learning module, which is used for constructing a correlation embedding learning space by combining real label information, and measuring the difference between a target function minimum prediction label and a real label by adopting classification based on cross entropy while completing a semantic feature learning task by minimizing a semantic correlation target function, so as to obtain a total target function of semantic similarity learning between different modalities in a final correlation embedding learning stage, and mining the haptic features most relevant to an image defect area in the haptic features, wherein the specific implementation process is as follows:
Step A, constructing a relevant embedded learning space by using category label information Y ═ Y }, Y ∈ {1,2, …, C }, and completing a semantic feature learning task, wherein the process mainly comprises the step of minimizing a semantic association objective function LremThe realization method comprises the following steps:
Figure BDA0003530028110000111
Figure BDA0003530028110000112
δpq=hp*tq
wherein y represents the true label, C represents the total number of training data classes, N represents the total number of training data, spqIs a class association factor, δpqIn order to be a feature correlation factor,
Figure BDA0003530028110000113
and
Figure BDA0003530028110000114
predictive labels for the pth haptic feature and the qth branch feature, respectively, (.)TDenotes a transpose operation, and hpAnd tqRespectively representing the pth haptic feature and the qth transition feature. By observing the semantic relevance function LremIt can be found that when spq=1,δpqThe larger, LremThe smaller and vice versa. The semantic relevance objective function LremIn the correlation embedding learning space, the transfer features with the same semantics can assist the sense of touch to carry out semantic feature learning, namely, the tactile features with the highest correlation degree with the image defect region are extracted from the tactile features extracted by the learnable feature extraction module.
Step B, in order to further enhance the feature resolution capability, the difference between the predicted label and the real label is minimized by adopting a loss function based on cross entropy to complete semantic classification, and the process is expressed as follows:
Figure BDA0003530028110000121
Figure BDA0003530028110000122
Wherein L is1And L2Classification metric objective functions representing the transfer features and the haptic features, respectively. y ispAnd
Figure BDA0003530028110000123
true and predicted labels, y, representing respectively the p-th haptic featureqAnd
Figure BDA0003530028110000124
a true label and a predicted label representing the qth branch feature, respectively.
Step C, synthesizing the loss functions in step A, B to obtain a total objective function of semantic similarity learning between different modalities in the relevant embedded learning stage, which can be expressed as:
Lsim=Lrem1L12L2
wherein L issimFor the final semantic similarity loss function, α1And alpha2Is hyperparametric, LremTo represent the semantic relatedness between the two modalities.
(2-4) a cross-mode image restoration module, configured to perform cross-mode restoration on defective image data by using the mined tactile features most relevant to the defective image region in combination with the inter-pixel perception constraint loss function, the appearance constraint loss function, and the countermeasure loss function, and the specific implementation process is as follows:
step A, adopting a decoder network to realize cross-modal image restoration; specifically, adding the tactile feature h and the defective image feature i to obtain a repaired image feature
Figure BDA0003530028110000125
Then restoring the image features
Figure BDA0003530028110000126
Inputting the decoder De to obtain a restored image
Figure BDA0003530028110000127
Namely, it is
Figure BDA0003530028110000128
The restoration process is restrained from two aspects of appearance characteristics and perception characteristics by utilizing real image data, so that the restored image
Figure BDA0003530028110000129
As similar as possible to the real image R, this process is expressed as:
Figure BDA00035300281100001210
Figure BDA00035300281100001211
Figure BDA00035300281100001212
wherein, thetadeSet of network parameters, L, for decoder DeaAnd LpRespectively an appearance constraint loss function and a perceptual constraint loss function,
Figure BDA00035300281100001213
is a function of distribution of the restored image
Figure BDA00035300281100001214
The expectation is that.
Figure BDA00035300281100001215
Is a feature extraction network similar to VGG, | | · |. non-woven phosphor1Operate to take the norm of L1.
In this module, the decoder De includes 2 fully-connected layers and 4 deconvolution layers, the dimensions of the fully-connected layers are 128 and 1024 respectively, the number of deconvolution is 64, 128, 256 and 512, and the output is a 128 × 128 repaired image
Figure BDA0003530028110000131
The discriminator D includes 3 convolutional layers and 3 fully-connected layers, the convolutional layer output dimension is 512, 256, 128, the convolutional kernel size is 5x5, the fully-connected layer dimension is 1024, 128, 1, and finally, the number in a range of (0, 1) is output to represent the probability that the input image is a real image.
Step B, after the image is restored in a mode crossing manner, further restricting the distribution structure of the restored image in the step A; specifically, the distribution of real image data is learned using a countermeasure loss function, defining a countermeasure loss function LadvCan be expressed as:
Figure BDA0003530028110000132
wherein L isadvIs a penalty function of the discriminator D, thetadFor the set of network parameters of the arbiter D,
Figure BDA0003530028110000133
And
Figure BDA0003530028110000134
are respectively the true image distribution function Pdata(R)And a repair image distribution function
Figure BDA0003530028110000135
D (R; θ)d) And
Figure BDA0003530028110000136
the probabilities of the true image and the restored image being true are identified for the discriminator D, respectively.
Step C, synthesizing appearance constraint loss functions L in step A and step BaPerceptual constraint loss function LpAnd a penalty function LadvThe loss function that the decoder De ultimately needs to minimize is:
Limp=La1Lp2Ladv
wherein L isimpAs a function of the final repair loss, beta1And beta2Are all hyper-parameters.
Step 3, training the cross-modal image restoration AGVI model by using a training set to obtain an optimal cross-modal image restoration AGVI model structure and network parameters, which are as follows:
step (3-1) of restoring the image
Figure BDA0003530028110000137
Combining the training set divided in step 1 and the real label information in step (2-3) into a standardized input training data set Dtr
Dtr={(In,Rn,Hn,yn),n=1,2,…,N},
Wherein, ynDefective image data I for the n-th group participating in trainingnTrue image data RnAnd tactile signal HnAnd N is the total capacity of the training data.
Step (3-2), initializing a network parameter set of the AGVI model, wherein the set comprises thetahidedThese parameters are initialized to a standard normal distribution. Wherein, thetahAnd thetaiParameter sets for the haptic mapping network and the image mapping network, respectively; theta deSet of network parameters, θ, for the decoder DedIs the set of network parameters for arbiter D.
And (3-3) setting the total iteration number as G-400, and recording the specific iteration number by using G.
Step (3-4), training an AGVI model by adopting a random gradient descent method; the specific process is as follows:
firstly, setting a hyper-parameter alpha1=10-3,α2=10-4,β1=10-4,β2=10-5Learning rate μ for haptic mapping network and image mapping network10.0005, learning rate μ of decoder arbiter2=0.0003;
Step two, calculating the output of each network in the AGVI model:
h=Fh(H;θh);i=Fi(I;θi);r=Fi(R;θi),
Figure BDA0003530028110000141
step three, starting iteration; updating the parameter set of each network from the negative gradient direction of the target based on a gradient descent method and an Adam optimizer:
Figure BDA0003530028110000142
Figure BDA0003530028110000143
Figure BDA0003530028110000144
Figure BDA0003530028110000145
wherein L issim(v) is the final semantic similarity loss function Lsim,Limp(v) is the final repair loss function Limp
Figure BDA0003530028110000146
And
Figure BDA0003530028110000147
haptic mapping network F after g +1 and g iterations, respectivelyhImage mapping network FiDecoder network De, network parameter set of discriminator network D;
Figure BDA0003530028110000148
is the derivative.
If G is less than G, jumping to the step (3-4), adding 1 to the iteration times (G is G +1), and continuing the next iteration; otherwise, the iteration is terminated.
And (3) finally outputting the optimal AGVI model structure and network parameters after G-round iteration.
Step 4, performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model by using the touch signal and the defect image data in the test set, wherein the method specifically comprises the following steps:
step (4-1) and test set D divided in step 1teComprises the following steps:
Dte={(I'j,H'j),j=1,2,…,F},
wherein, I'jAnd H'jDefective image data and tactile signal I 'of the j-th group pair'jAnd H'jThe method is used for model testing, and F is the total amount of test data;
(4-2) test set DteThe data in the step (3) are input in pairs into the optimal cross-mode image restoration AGVI model trained in the step (3), and the output is the restored image.
As shown in fig. 3, the present invention further relates to a cross-mode image restoration apparatus based on attention mechanism, which specifically includes: a memory, a processor, and a computer program stored on the memory and executable on the processor; wherein:
1. a memory is used for storing at least one program.
2. And using a processor for loading at least one program to execute the attention-based cross-mode image restoration method in the embodiment to realize high-quality and fine-grained restoration of the defective image.
Performance evaluation:
the invention carries out experiments according to the process, selects an LMT Material Surface standard data set as an experimental data set, and the data set is published in IEEE TRANSACTIONS ON HANDICS journal in 2017 by the literature of 'Multimodal Feature-based Surface Material Classification' (the authors are Matti Strese, Clemens Schuwerk, Albert Iepure, and Eckey Steinbach). Material information containing three instances of image, sound and tactile acceleration, 80% of the data (tactile, image) from each category were selected as the training set, and the remaining 20% were used as the test set.
The prior method I comprises the following steps: the document "PatchMatch: A random corrected texture algorithm for structural image editing" (by Connelly Barnes, Eli Shechman, Adam Finkelstein, Dan B Goldman et al), proposes a typical block matching image repair algorithm for image repair by finding similar blocks in non-defective regions and copying them to defective regions. This is a restoration method in a modality, and restoration is performed mainly based on information of an image itself.
The prior method II comprises the following steps: document "Context Encoders: Feature Learning by Inpaiting "(Author: deep Pathak, Philipp)
Figure BDA0003530028110000151
Jeff Donahue, Trevor Darrell) proposes an unsupervised image feature learning algorithm based on context pixels, and uses the codec as a basic structure and the pixel information of the image itself to complete the repair and filling of the content of the defect area.
The existing method is three: the document "unserviced restoration left with Deep adaptive genetic additive Networks" (author Radford Alec, Metz Luke, Chintala Soumith) proposes an image restoration model based on Deep convolution generation of a countermeasure network, which includes two models: a generating model G used for capturing data distribution and a discriminant model D used for estimating the probability that the sample comes from training data instead of G, and the image restoration is realized based on the mode of the game fighting.
The prior method comprises the following steps: the document "Text-Guided Neural Image Inpainting" (author Lisai Zhang, Qingcai Chen, Baotian Hu, Shuoran Jiang) proposes an Image restoration algorithm based on character guidance, which takes descriptive Text as a condition, compares semantic contents of a given Text and residual images through an attention mechanism, and finds out semantic contents filled for defective parts to carry out Image restoration.
The first reference method comprises the following steps: and removing the attention mechanism module, and only fusing the tactile features and the defect image features to realize the image restoration. The method can be used to verify the importance of the attention mechanism module.
The invention comprises the following steps: the method of the present embodiment.
The quality of the repaired image is evaluated mainly from two aspects of subjective qualitative evaluation and objective quantitative evaluation.
First, in terms of subjective qualitative assessment, fig. 4 shows image restoration effect graphs of AGVI models of the comparison scheme and the method of the present invention, respectively. The defect rate in this experiment was 40% and the size of the defect region was about 51 × 51. Each row of images represents the original image, the defect image, the prior method I, the prior method II, the prior method III, the prior method IV, the reference method I and the AGVI model of the method from left to right. As can be seen from fig. 4, compared with the depth image restoration method, the restoration result of the first conventional method has a poor visual perception effect, the texture cannot be clearly resolved, and the problems of image blur and detail loss are also obvious. The second existing method based on the self-encoder only has accurate and reasonable color repair to the damaged area, and the details of the structure and the texture are still very fuzzy. Compared with the first method and the second method, the repair result shows clearer texture and structural features, but has more obvious blurring and artifact phenomena at the edge. In the repair result of the AGVI model of the method, the semantic information of the repair area and the whole image is relatively fit, the artifacts and the fuzzy phenomenon at the edge disappear, and the repair effect is more vivid. Meanwhile, compared with the first reference method, the designed model has more complete texture and structure information and the image restoration performance is optimal.
Table I evaluation results of the present invention show
Method MSE SSIM
Prior document I 130.230 0.587
Prior art document 2 125.622 0.616
Prior art document three 126.145 0.606
Prior art document four 118.547 0.657
Reference method 1 115.369 0.623
AGVI 100.750 0.845
Secondly, in the aspect of objective quantitative evaluation, a common Mean Square Error (MSE) and Structure metric (SSIM) in image quality evaluation indexes are adopted, and the smaller the MAE and the smaller the SSIM, the better the image quality of cross-modal restoration is. Table I shows the MSE scores and SSIM scores of the AGVI model of the invention and other comparative models, and the performance of the AGVI model was evaluated from two perspectives, image perception and structural comparison, respectively. As can be seen from the table, the AGVI model of the invention has the highest MSE score and the highest SSIM score. The first, second and third existing methods mainly perform intra-modal repair based on the information of the defective image, but when the image is severely defective, the repair effect is poor, the MSE score reaches about 130, and the SSIM value is only about 0.6. The existing method four, the reference method one and the AGVI model of the invention verify the content repair capability of the non-image modal information on the defect area and obviously improve the repair effect by the attention mechanism. In particular, the AGVI model designed by the method of the present invention combines the repair capability of the self-encoder with the selective attention capability of the attention mechanism, and the repair results have the best visual quality.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (9)

1. A cross-mode image restoration method based on an attention mechanism is characterized by comprising the following steps:
the method comprises the following steps of 1, selecting a multi-modal data set, wherein the multi-modal data set comprises defect image data, real image data and a touch signal, and is divided into a training set and a testing set;
step 2, designing a cross-mode image restoration AGVI model based on an attention mechanism, wherein the model comprises the following steps:
the learnable feature extraction module is used for extracting the features of the tactile signals, the defective image data and the real image data and participating in subsequent end-to-end model training;
the transfer characteristic attention module is used for introducing an attention mechanism, positioning the image defect area and acquiring transfer characteristics representing the defect area;
the relevant embedding learning module is used for constructing a relevant embedding learning space by combining real label information, measuring the difference between a target function minimum prediction label and a real label by adopting classification based on cross entropy while completing a semantic feature learning task by a minimum semantic associated target function, obtaining a total target function of semantic similarity learning among different modalities in a final relevant embedding learning stage, and mining the most relevant tactile features in the extracted tactile features and the image defect area;
The cross-modal image restoration module is used for combining the perception constraint loss function, the appearance constraint loss function and the antagonistic loss function among the pixels and performing cross-modal restoration on the defective image data by utilizing the mined tactile characteristics most related to the defective image area;
step 3, training the cross-modal image restoration AGVI model by using a training set to obtain an optimal cross-modal image restoration AGVI model structure and network parameters;
and 4, performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model by using the touch signals and the defective image data in the test set.
2. The attention-based trans-modal image inpainting method of claim 1, wherein the selecting a multi-modal dataset in step 1 specifically comprises:
selecting three different modality data including defect image data I, real image data R and tactile signals H to form a multi-modal data set D; the real image data is an original color image signal, the touch signal is touch power spectral density obtained by preprocessing the touch original signal, the defect image data is an image with a defect rate of lambda obtained by preprocessing the real image data, and the value range of lambda is between 0 and 1;
Step (1-2), for data of different modes in the multi-mode data set D, carrying out statistics on real label information Y of the data, namely using a one-hot code to print a category label of content information represented by each data;
step (1-3) of randomly selecting data with the proportion of alpha from the multi-modal data set D as a training set DtrThe remaining data of 1-alpha ratio is used as test set DteAnd the value of alpha ranges from 0 to 1.
3. The attention mechanism-based cross-modal image inpainting method of claim 1, wherein the learnable feature extraction module in step 2 extracts features of the haptic signal, the defective image data, and the real image data, and specifically comprises:
for the tactile signal H, a gate cycle unit GRU and a 3-layer fully-connected network are adopted as a tactile mapping network to acquire a tactile feature H and a tactile feature prediction label y(h)
For defect image data I and real image data R, extracting shallow defect image characteristics I and real image characteristics R by adopting an image mapping network formed by a depth convolution neural network, and the specific process is as follows:
Figure FDA0003530028100000021
Figure FDA0003530028100000022
Figure FDA0003530028100000023
wherein, h, i and r are respectively tactile feature, defect image feature and real image feature, and the dimensionality of h, i and r is respectively
Figure FDA0003530028100000024
And
Figure FDA0003530028100000025
θhand thetaiRespectively haptic mapping network Fh(H;θh) And an image mapping network Fi(I/R;θi) The parameter set of (2).
4. The cross-modal image inpainting method based on the attention mechanism as claimed in claim 1, wherein the step 2 of acquiring the transfer characteristics characterizing the defect region by the attention transfer module specifically comprises:
step A, defining each characteristic value as a characteristic unit for the defect image characteristic i and the real image characteristic r, namely
Figure FDA0003530028100000026
And
Figure FDA0003530028100000027
ikand rlRespectively represent the defective image characteristics ik feature values and the l-th feature value in the real image feature r,
Figure FDA0003530028100000028
and
Figure FDA0003530028100000029
respectively representing the dimensions of the defect image characteristic i and the real image characteristic r; then, normalized inner product is adopted to each i in defect image characteristics ikAnd each of the real image features rlCalculating to obtain cosine similarity between all feature units of the two features, wherein the cosine similarity is specifically represented as follows:
Figure FDA00035300281000000210
wherein, ck,lExpressing a cosine similarity matrix, | | | | | represents a modular operation, and <, > represents a normalized inner product operation;
step B, transferring the most relevant part of each characteristic unit for characterizing the defect area in the defect image characteristic from the real image characteristic, namely a cosine similarity matrix c for k rows and l columns k,lTakes the maximum value for each column of (a), this process is expressed as:
Figure FDA00035300281000000211
wherein, akRepresenting a feature metric most correlated with the k-th position of the defect image feature i in the real image feature r for the distraction index;
step C, based on the attention transfer index, performing feature selection operation on the real image features r to acquire transfer characteristics t representing the image defect area from the real image features, wherein the process is represented as follows:
tk=rak
wherein, tkRepresenting the selection of the a-th in the real image features rkThe characteristic value of each position is transferred to obtain the characteristic value of the kth position in the transfer characteristic t, and then the characteristic value is subjected to characteristic classification through a sigmoid layer to obtain a prediction label y of the transfer characteristic(t)
5. The attention mechanism-based cross-mode image restoration method according to claim 1, wherein the step 2 of mining the haptic features most relevant to the image defect area in the haptic features by the relevant embedded learning module specifically comprises:
step A, constructing a relevant embedding learning space by using real label information Y ═ Y }, Y ∈ 1,2, … and C to finish a semantic feature learning task, wherein the process mainly comprises the step of minimizing a semantic association objective function LremThe realization method comprises the following steps:
Figure FDA0003530028100000031
Figure FDA0003530028100000032
δpq=hp*tq
wherein y represents the true label, C represents the total number of training data classes, N represents the total number of training data, s pqIs a class association factor, δpqIn order to be a feature association factor, the feature association factor,
Figure FDA0003530028100000033
and
Figure FDA0003530028100000034
a predictive label for the p-th haptic feature and a predictive label for the q-th branch feature, respectively, (. cndot.)TDenotes a transpose operation, hpAnd tqRespectively representing the p tactile feature and the q transfer feature; the semantic association objective function ensures that in the relevant embedded learning space, transfer features with the same semantics can assist the haptic sense to proceedSemantic feature learning, namely extracting the haptic features with the highest correlation degree with the image defect area from the haptic features;
step B, in order to further enhance the feature resolution capability, the difference between the predicted label and the real label is minimized by adopting a loss function based on cross entropy to complete semantic classification, and the process is expressed as follows:
Figure FDA0003530028100000035
Figure FDA0003530028100000036
wherein L is1And L2Classification metric objective function, y, representing transfer and haptic features, respectivelypAnd
Figure FDA0003530028100000037
true and predicted labels, y, representing respectively the p-th haptic featureqAnd
Figure FDA0003530028100000038
true and predicted labels representing the qth branch feature, respectively;
step C, the step A and the step B are combined to obtain a total objective function of semantic similarity learning among different modes in a relevant embedding learning stage, and the total objective function is expressed as:
Lsim=Lrem1L12L2
wherein L issimFor the final semantic similarity loss function, α 1And alpha2Is a hyperparameter, LremTo represent the semantic relatedness between two modalities.
6. The attention-based cross-modal image inpainting method of claim 1, wherein the cross-modal image inpainting module in step 2 performs cross-modal inpainting on the defective image data by using the mined tactile features most relevant to the defective image region, and specifically comprises:
step A, adopting a decoder network to realize cross-modal image restoration; specifically, adding the tactile feature h and the defect image feature i to obtain a repaired image feature
Figure FDA0003530028100000041
Then restoring the image features
Figure FDA0003530028100000042
Inputting the decoder De to obtain a restored image
Figure FDA0003530028100000043
Namely, it is
Figure FDA0003530028100000044
The restoration process is restrained from two aspects of appearance characteristics and perception characteristics by utilizing real image data, so that the restored image
Figure FDA0003530028100000045
As similar as possible to the real image R, this process is expressed as:
Figure FDA0003530028100000046
Figure FDA0003530028100000047
Figure FDA0003530028100000048
wherein, thetadeSet of network parameters, L, for the decoder DeaAnd LpRespectively, appearance constraint loss functionAnd a perceptual constraint loss function that,
Figure FDA0003530028100000049
is a function of distribution of the restored image
Figure FDA00035300281000000410
In the expectation that the position of the target is not changed,
Figure FDA00035300281000000411
is a feature extraction network similar to VGG, | | · |. non-woven phosphor1An L1 norm operation is taken;
b, restraining the distribution structure of the repaired image in the step A; specifically, the distribution of real image data is learned using a countermeasure loss function, defining a countermeasure loss function L advExpressed as:
Figure FDA00035300281000000412
wherein L isadvIs a penalty function of the discriminator D, θdIs a set of network parameters for the arbiter D,
Figure FDA00035300281000000413
and
Figure FDA00035300281000000414
are respectively the true image distribution function Pdata(R)And a repair image distribution function
Figure FDA00035300281000000415
D (R; theta)d) And
Figure FDA00035300281000000416
the probabilities that the real image and the restored image are true are respectively identified for the discriminator D;
step C, synthesizing appearance constraint loss functions L in step A and step BaPerceptual constraint loss function LpAnd a penalty function LadvThe loss function that the decoder De ultimately needs to minimize is:
Limp=La1Lp2Ladv
wherein L isimpAs a function of the final repair loss, beta1And beta2Are all hyper-parameters.
7. The cross-modal image restoration method based on the attention mechanism according to claim 1, wherein the training of the cross-modal image restoration AGVI model by using the training set in the step 3 specifically comprises:
step (3-1) of restoring the image
Figure FDA0003530028100000051
Combining the training set divided in step 1 and the real label information in step (2-3) into a standardized input training data set Dtr
Dtr={(In,Rn,Hn,yn),n=1,2,…,N},
Wherein, ynDefective image data I for the n-th group participating in trainingnTrue image data RnAnd tactile signal HnN is the total capacity of training data;
step (3-2), initializing a network parameter set of the AGVI model, wherein the set comprises theta hidedInitializing these parameters to a standard normal distribution; wherein, thetahAnd thetaiParameter sets for the haptic mapping network and the image mapping network, respectively; thetadeSet of network parameters, θ, for the decoder DedA set of network parameters for discriminator D;
step (3-3), setting the total iteration number as G, and recording the specific iteration number by using G;
step (3-4), training a cross-modal image restoration AGVI model by adopting a random gradient descent method, and specifically comprising the following steps:
firstly, setting a hyper-parameter alpha1212Learning rate μ for haptic mapping network and image mapping network1Learning rate mu of decoder and discriminator2
Step two, calculating the output of each network in the cross-modal image restoration AGVI model:
h=Fh(H;θh);i=Fi(I;θi);r=Fi(R;θi),
Figure FDA0003530028100000052
step three, starting iteration; updating the parameter set of each network from the direction of the negative gradient of the target based on a gradient descent method and an Adam optimizer:
Figure FDA0003530028100000053
Figure FDA0003530028100000054
Figure FDA0003530028100000055
Figure FDA0003530028100000056
wherein L issim(v) is the final semantic similarity loss function Lsim,Limp(v) is the final repair loss function Limp
Figure FDA0003530028100000057
And
Figure FDA0003530028100000058
haptic mapping network F after g +1 and g iterations, respectivelyhImage mapping network FiDecoder network De, network parameter set of discriminator network D; v is a derivative;
if G is less than G, jumping to the step (3-4), and continuing the next iteration, wherein G is G + 1; otherwise, terminating the iteration;
And (3-5) finally outputting the structure and the network parameters of the optimal cross-mode image restoration AGVI model after G-round iteration.
8. The cross-modal image restoration method based on the attention mechanism according to claim 1, wherein the step 4 of performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model specifically comprises:
step (4-1), test set D divided in step 1teComprises the following steps:
Dte={(I'j,H'j),j=1,2,…,F},
wherein, I'jAnd H'jDefective image data and tactile signal I 'of the j-th group pair'jAnd H'jThe method is used for model testing, and F is the total amount of test data;
step (4-2), test set DteThe data in the step (3) are input in pairs into the optimal cross-mode image restoration AGVI model trained in the step (3), and the output is the restored image.
9. An attention-based cross-modal image restoration device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the computer program, when loaded into the processor, implements an attention-based cross-modal image restoration method according to any one of claims 1 to 8.
CN202210205553.5A 2022-03-03 2022-03-03 Cross-mode image restoration method and device based on attention mechanism Pending CN114677311A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210205553.5A CN114677311A (en) 2022-03-03 2022-03-03 Cross-mode image restoration method and device based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210205553.5A CN114677311A (en) 2022-03-03 2022-03-03 Cross-mode image restoration method and device based on attention mechanism

Publications (1)

Publication Number Publication Date
CN114677311A true CN114677311A (en) 2022-06-28

Family

ID=82072316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210205553.5A Pending CN114677311A (en) 2022-03-03 2022-03-03 Cross-mode image restoration method and device based on attention mechanism

Country Status (1)

Country Link
CN (1) CN114677311A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2606836A (en) * 2021-03-15 2022-11-23 Adobe Inc Generating modified digital images using deep visual guided patch match models for image inpainting
CN116523799A (en) * 2023-07-03 2023-08-01 贵州大学 Text-guided image restoration model and method based on multi-granularity image-text semantic learning
CN116630203A (en) * 2023-07-19 2023-08-22 科大乾延科技有限公司 Integrated imaging three-dimensional display quality improving method
CN116681980A (en) * 2023-07-31 2023-09-01 北京建筑大学 Deep learning-based large-deletion-rate image restoration method, device and storage medium
CN117853492A (en) * 2024-03-08 2024-04-09 厦门微亚智能科技股份有限公司 Intelligent industrial defect detection method and system based on fusion model

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2606836A (en) * 2021-03-15 2022-11-23 Adobe Inc Generating modified digital images using deep visual guided patch match models for image inpainting
GB2606836B (en) * 2021-03-15 2023-08-02 Adobe Inc Generating modified digital images using deep visual guided patch match models for image inpainting
CN116523799A (en) * 2023-07-03 2023-08-01 贵州大学 Text-guided image restoration model and method based on multi-granularity image-text semantic learning
CN116523799B (en) * 2023-07-03 2023-09-19 贵州大学 Text-guided image restoration model and method based on multi-granularity image-text semantic learning
CN116630203A (en) * 2023-07-19 2023-08-22 科大乾延科技有限公司 Integrated imaging three-dimensional display quality improving method
CN116630203B (en) * 2023-07-19 2023-11-07 科大乾延科技有限公司 Integrated imaging three-dimensional display quality improving method
CN116681980A (en) * 2023-07-31 2023-09-01 北京建筑大学 Deep learning-based large-deletion-rate image restoration method, device and storage medium
CN116681980B (en) * 2023-07-31 2023-10-20 北京建筑大学 Deep learning-based large-deletion-rate image restoration method, device and storage medium
CN117853492A (en) * 2024-03-08 2024-04-09 厦门微亚智能科技股份有限公司 Intelligent industrial defect detection method and system based on fusion model

Similar Documents

Publication Publication Date Title
CN114677311A (en) Cross-mode image restoration method and device based on attention mechanism
Dai et al. Human action recognition using two-stream attention based LSTM networks
Jinjin et al. Pipal: a large-scale image quality assessment dataset for perceptual image restoration
CN109544524B (en) Attention mechanism-based multi-attribute image aesthetic evaluation system
CN113627482B (en) Cross-modal image generation method and device based on audio-touch signal fusion
Huang et al. Medical image segmentation using deep learning with feature enhancement
Ji et al. Blind image quality assessment with semantic information
CN115546171A (en) Shadow detection method and device based on attention shadow boundary and feature correction
CN115359074A (en) Image segmentation and training method and device based on hyper-voxel clustering and prototype optimization
Li et al. Spectral feature fusion networks with dual attention for hyperspectral image classification
Mi et al. KDE-GAN: A multimodal medical image-fusion model based on knowledge distillation and explainable AI modules
Chen et al. Video‐based action recognition using spurious‐3D residual attention networks
Hu et al. Hierarchical discrepancy learning for image restoration quality assessment
Liu et al. Dunhuang murals contour generation network based on convolution and self-attention fusion
Zeng et al. Combining CNN and transformers for full-reference and no-reference image quality assessment
Qin et al. Virtual reality video image classification based on texture features
CN116630964A (en) Food image segmentation method based on discrete wavelet attention network
Sobal et al. Joint embedding predictive architectures focus on slow features
Mu et al. Underwater image enhancement using a mixed generative adversarial network
Chen et al. Rethinking visual reconstruction: Experience-based content completion guided by visual cues
Li et al. No‐reference image quality assessment based on multiscale feature representation
Han et al. Blind image quality assessment with channel attention based deep residual network and extended LargeVis dimensionality reduction
Pirabaharan et al. Improving interactive segmentation using a novel weighted loss function with an adaptive click size and two-stream fusion
Ye et al. Human action recognition method based on Motion Excitation and Temporal Aggregation module
Lyra et al. A multilevel pooling scheme in convolutional neural networks for texture image recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination