CN114677311A

CN114677311A - Cross-mode image restoration method and device based on attention mechanism

Info

Publication number: CN114677311A
Application number: CN202210205553.5A
Authority: CN
Inventors: 魏昕; 姚玉媛; 周亮; 高赟
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-06-28

Abstract

The invention discloses a cross-mode image restoration method and a cross-mode image restoration device based on an attention mechanism, wherein the method comprises the following steps: selecting a multi-mode data set comprising defect image data, real image data and a touch signal, and dividing the data set into a training set and a test set; designing an attention mechanism-based cross-modal image restoration AGVI model, wherein the model comprises four modules of learnable feature extraction, feature attention transfer, correlation embedding learning and cross-modal image restoration; training the cross-modal image restoration AGVI model by using a training set to obtain an optimal cross-modal image restoration AGVI model structure and network parameters; and performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model by using the touch signal and the defect image data in the test set. According to the method, an attention mechanism is introduced, the image defect area is accurately positioned, the key information in the touch signal is utilized to repair, predict and fill the area, and high-quality and fine-grained repair of the image is achieved.

Description

Cross-mode image restoration method and device based on attention mechanism

Technical Field

The invention relates to a cross-mode image restoration method and device based on an attention mechanism, and belongs to the technical field of image restoration.

Background

In recent years, with the rapid development of wireless communication and multimedia technologies, multimedia audiovisual services such as ultra-high definition video and network live broadcast have basically met audiovisual requirements of users. In order to further pursue more sophisticated interactive feelings and scene experiences, more and more users are exploring sensory experiences other than audio-visual, such as touch, smell, taste, etc. With the rapid development of the haptic internet under 5G communication, a new haptic service, namely a multi-modal service, is merged into the traditional audio-visual service, and immersive experience is provided for users in various application scenes such as virtual games, remote education, rehabilitation and medical treatment. In order to support such multi-modal services composed of audio, video and touch, it is very important to realize efficient transmission of heterogeneous code streams. However, in actual transmission, due to the influence of noise, transmission loss and other factors, the receiving end often has a serious problem of image data distortion or loss, and the quality of the multi-mode signal is difficult to be guaranteed.

At present, the problems of image distortion and partial defects in the transmission process are mainly considered from image restoration. The mainstream image restoration method mainly comprises the following steps: 1) based on the signal processing; 2) based on a loss function; 3) three categories based on depth models.

Firstly, image restoration is performed by using typical signal processing techniques, such as low-rank matrix, sparse representation, and the like, and the main idea is to represent an original data matrix as a low-rank part or a sparse part, and analyze and apply the low-rank part or the sparse part. For example, researchers have proposed a sparse optimization method "l₀TV-PADMM ", the method converts the image distortion into the problem of equivalent biconvex Mathematical Programming (MPEC) with balanced constraint, finds specific signals with similar structures by using sparse representation, and estimates the defect content of the image. In order to further utilize the local and non-local characteristics of the image, a learner applies transformation learning to the adaptive sparse representation, and an image restoration scheme based on joint low-rank regularization transformation learning is provided. The method is mainly based on signal processing technology to carry out image restoration, has simple principle and simple and convenient operation, but has poor image restoration effect and low quality.

Secondly, the image restoration process is normalized by strengthening the constraint action of the loss function. For example, a scholars fully utilizes the priori knowledge of the visual image to construct a mixed regularized loss function, and searches for error fitting of all image frames through regularization terms of a spatial domain and a spectral domain, so as to realize high-quality image restoration. To further enhance the controllability of image denoising, researchers have also introduced a diversity objective function that correlates input noise with image semantics, minimizes the restoration distance between images, and allows users to manipulate the output image by manually adjusting the noise. The constraint of the method on the loss function strengthens the image restoration process, is beneficial to improving the quality of the restored image, but has defects on the detail capture of the image restoration.

And finally, realizing accurate image restoration by virtue of the superior performance of deep learning. With the rapid development of artificial intelligence technology, deep learning has become the mainstream technology of image restoration at present due to its strong learning ability. For example, researchers have designed a quality enhancement network based on a depth model, which employs a residual network and recursive learning, significantly reduces image artifacts with similar frequency characteristics, and removes noise-induced distortion and blurring. In addition, the denoising problem is modeled into a function optimization process with design cost, model parameters are estimated through semi-supervised learning based on a large number of defect images and a small number of labeled training samples, noise is accurately and effectively removed, and high-quality image restoration is achieved. The method can often achieve a good repairing effect, but has the serious problems of high model complexity, large calculation amount and the like.

Disclosure of Invention

The invention aims to solve the technical problem of overcoming the defects of the prior art and provides a cross-mode image restoration method and a device based on an attention mechanism, wherein a touch signal is applied to the cross-mode image restoration method and the device; specifically, the method selects a standard multi-Modal dataset model for training and testing, establishes an AGVI (Attention-Guided Cross-Modal Visual Image Inpainting) model based on the tactile signals, and realizes Cross-Modal Image restoration. By adopting the method and the device, the image defect area can be accurately positioned, the high-quality and fine-grained image can be restored under the condition of low model complexity, the image restoration quality is improved, and the terminal service level and the user immersive experience are ensured.

The invention specifically adopts the following technical scheme to solve the technical problems:

a cross-mode image restoration method based on an attention mechanism comprises the following steps:

the method comprises the following steps that 1, a multi-mode data set is selected, wherein the multi-mode data set comprises defect image data, real image data and touch signals, and the multi-mode data set is divided into a training set and a testing set;

step 2, designing a cross-mode image restoration AGVI model based on an attention mechanism, wherein the model comprises the following steps:

the learnable feature extraction module is used for extracting the features of the tactile signals, the defective image data and the real image data and participating in subsequent end-to-end model training;

the transfer characteristic attention module is used for introducing an attention mechanism, positioning the image defect area and acquiring transfer characteristics representing the defect area;

the relevant embedded learning module is used for constructing a relevant embedded learning space by combining the information of the real labels, measuring the difference between the target function minimum prediction label and the real label by adopting classification based on cross entropy while completing a semantic feature learning task by minimizing a semantic associated target function, obtaining a total target function of semantic similarity learning between different modalities in a final relevant embedded learning stage, and excavating the tactile features which are most relevant to an image defect area in the tactile features;

The cross-modal image restoration module is used for performing cross-modal restoration on the defective image data by utilizing the mined tactile characteristics most relevant to the defective image area in combination with the inter-pixel perception constraint loss function, the appearance constraint loss function and the countermeasure loss function;

step 3, training the cross-modal image restoration AGVI model by using a training set to obtain an optimal cross-modal image restoration AGVI model structure and network parameters;

and 4, performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model by using the touch signals and the defective image data concentrated in the test.

Further, as a preferred technical solution of the present invention, the selecting a multi-modal dataset in step 1 specifically includes:

selecting three different modal data including defect image data I, real image data R and tactile signals H to form a multi-modal data set D; the real image data is an original color image signal, the touch signal is touch power spectral density obtained by preprocessing the touch original signal, the defect image data is an image with a defect rate of lambda obtained by preprocessing the real image data, and the value range of lambda is between 0 and 1;

Step (1-2), for data of different modes in the multi-mode data set D, carrying out statistics on real label information Y of the data, namely using a one-hot code to mark a category label of content information represented on each data;

step (1-3) of randomly selecting data with the proportion of alpha from the multi-modal data set D as a training set D_trThe remaining data of 1-alpha ratio is used as test set D_teAnd the value of alpha ranges from 0 to 1.

Further, as a preferred technical solution of the present invention, the extracting characteristics of the haptic signal, the defective image data, and the real image data by the learnable characteristic extracting module in step 2 specifically includes:

for the tactile signal H, a gate cycle unit GRU and a 3-layer fully-connected network are adopted as a tactile mapping network to acquire a tactile feature H and a tactile feature prediction label y^(h)；

Extracting shallow defect image characteristics I and real image characteristics R by adopting an image mapping network formed by a depth convolution neural network for defect image data I and real image data R; the specific process is as follows:

wherein, h, i and r are respectively tactile feature, defect image feature and real image feature, and the dimensionality of h, i and r is respectively

And

θ_hand theta _iRespectively a haptic mapping network F_h(H；θ_h) And an image mapping network F_i(I/R；θ_i) The parameter set of (2).

Further, as a preferred technical solution of the present invention, the acquiring, by the attention transfer module in step 2, a transfer characteristic characterizing the defect region specifically includes:

step A, defining each characteristic value as a characteristic unit for the defect image characteristic i and the real image characteristic r, namely

And

i_kand r_lRespectively representing the kth characteristic value in the defect image characteristic i and the l characteristic value in the real image characteristic r,

and

respectively representing the dimensions of the defect image characteristic i and the real image characteristic r; then, normalized inner product is adopted to each i in defect image characteristics i_kAnd each of the real image features r_lCalculating to obtain cosine similarity between all feature units of the two features, wherein the cosine similarity is specifically represented as follows:

wherein, c_k,lExpressing cosine similarity matrix, | | | | represents modulus operation, < |, > represents normalized inner product operation;

step B, transferring the most relevant part of each characteristic unit for characterizing the defect area in the defect image characteristic from the real image characteristic, namely a cosine similarity matrix c for k rows and l columns_k,lTakes the maximum value, this process is expressed as:

Wherein, a_kRepresenting a feature metric most correlated with the k-th position of the defect image feature i in the real image feature r for the distraction index;

step C, based on the attention transfer index, performing feature selection operation on the real image features r to acquire transfer characteristics t representing the image defect area from the real image features, wherein the process is represented as follows:

t_k＝ra_k，

wherein, t_kRepresenting the a-th of the selected real image features r_kThe characteristic value of each position is transferred to obtain the characteristic value of the kth position in the transfer characteristic t, and then the characteristic value is subjected to characteristic classification through a sigmoid layer to obtain a prediction label y of the transfer characteristic^(t)。

Further, as a preferred technical solution of the present invention, the mining, by the associated embedded learning module, the haptic characteristics most relevant to the image defect region in the haptic characteristics in step 2 specifically includes:

step A, constructing a relevant embedding learning space by using real label information Y ═ Y }, Y ∈ 1,2, … and C to finish a semantic feature learning task, wherein the process mainly comprises the step of minimizing a semantic association objective function L_remThe realization is as follows:

δ_pq＝h_p*t_q，

wherein y represents a true label, C represents a total number of training data classes, N represents a total number of training data, s_pqIs a class association factor, δ _pqIn order to be a feature association factor, the feature association factor,

and

predictive labels for the pth haptic feature and the qth branch feature, respectively, (.)^TDenotes a transpose operation, h_pAnd t_qRespectively representing the pth tactile feature and the qth transfer feature; the semantic association objective function ensures that in a relevant embedding learning space, the transfer features with the same semantic can assist the touch sense to carry out semantic feature learning, namely the touch features with the highest degree of correlation with the image defect area are extracted from the touch features;

step B, in order to further enhance the feature resolution capability, the difference between the predicted label and the real label is minimized by adopting a loss function based on cross entropy to complete semantic classification, and the process is expressed as follows:

wherein L is₁And L₂Classification metric objective function, y, representing transfer and haptic features, respectively_pAnd

true and predicted labels, y, representing respectively the p-th haptic feature_qAnd

true and predicted labels representing the qth branch feature, respectively;

step C, the step A and the step B are combined to obtain a total objective function of semantic similarity learning among different modes in a relevant embedding learning stage, and the total objective function is expressed as:

L_sim＝L_rem+α₁L₁+α₂L₂，

wherein L is_simFor the final semantic similarity loss function, α ₁And alpha₂Is hyperparametric, L_remTo represent the semantic relatedness between the two modalities.

Further, as a preferred technical solution of the present invention, the step 2 of performing cross-modal image restoration on the defective image data by using the mined haptic features most related to the defective image area in the step 2 specifically includes:

step A, adopting a decoder network to realize cross-modal image restoration; specifically, adding the tactile feature h and the defective image feature i to obtain a repaired image feature

Then restoring the image features

Inputting the decoder De to obtain a restored image

Namely, it is

The restoration process is restrained from two aspects of appearance characteristics and perception characteristics by utilizing real image data, so that the restored image

As similar as possible to the real image R, this process is expressed as:

wherein, theta_deSet of network parameters, L, for the decoder De_aAnd L_pRespectively an appearance constraint loss function and a perceptual constraint loss function,

is a function of distribution of the restored image

In the expectation that the position of the target is not changed,

is a feature extraction network similar to VGG, | | · |. non-woven phosphor₁An L1 norm operation is taken;

b, restraining the distribution structure of the repaired image in the step A; specifically, the distribution of real image data is learned using a countermeasure loss function, defining a countermeasure loss function L _advExpressed as:

wherein L is_advIs a penalty function of the discriminator D, theta_dFor the set of network parameters of the arbiter D,

and

respectively, the true image distribution function P_data(R)And repairing the image distribution function

D (R; theta)_d) And

a rule for discriminating the true image and the restored image respectively for the discriminator DRate;

step C, synthesizing appearance constraint loss functions L in step A and step B_aPerceptual constraint loss function L_pAnd a penalty function L_advThe loss function that the decoder De ultimately needs to minimize is:

L_imp＝L_a+β₁L_p+β₂L_adv，

wherein L is_impAs a function of the final repair loss, beta₁And beta₂Are all hyper-parameters.

Further, as a preferred technical solution of the present invention, the training of the cross-modal image restoration AGVI model in step 3 by using a training set specifically includes:

step (3-1) of restoring the image

Combining the training set divided in step 1 and the real label information in step (2-3) into a standardized input training data set D_tr：

D_tr＝{(I_n,R_n,H_n,y_n),n＝1,2,…,N}，

Wherein, y_nDefective image data I for the n-th group participating in training_nTrue image data R_nAnd tactile signal H_nN is the total capacity of training data;

step (3-2), initializing a network parameter set of the AGVI model, wherein the set comprises theta _h,θ_i,θ_de,θ_dInitializing these parameters to a standard normal distribution; wherein, theta_hAnd theta_iParameter sets for the haptic mapping network and the image mapping network, respectively; theta_deSet of network parameters, θ, for the decoder De_dA set of network parameters for discriminator D;

step (3-3), setting the total iteration number as G, and recording the specific iteration number by using G;

step (3-4), training a cross-modal image restoration AGVI model by adopting a random gradient descent method, and specifically comprising the following steps:

firstly, setting a hyper-parameter alpha₁,α₂,β₁,β₂Learning rate μ for haptic mapping network and image mapping network₁Learning rate mu of decoder and discriminator₂；

Step two, calculating the output of each network in the cross-modal image restoration AGVI model:

h＝F_h(H；θ_h)；i＝F_i(I；θ_i)；r＝F_i(R；θ_i),

step three, starting iteration; updating the parameter set of each network from the negative gradient direction of the target based on a gradient descent method and an Adam optimizer:

wherein L is_sim(v) is the final semantic similarity loss function L_sim，L_imp(v) is the final repair loss function L_imp；

And

haptic mapping network F after g +1 and g iterations, respectively_hImage mapping network F_iDecoder network De, network parameter set of discriminator network D;

is a derivative;

if G is smaller than G, skipping to the step (3-4) and continuing the next iteration, wherein G is G + 1; otherwise, terminating the iteration;

And (3-5) after G-round iteration, finally outputting the structure and network parameters of the optimal cross-modal image restoration AGVI model.

Further, as a preferred technical solution of the present invention, the performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model in step 4 specifically includes:

step (4-1) and test set D divided in step 1_teComprises the following steps:

D_te＝{(I'_j,H'_j),j＝1,2,…,F},

wherein, I'_jAnd H'_jDefective image data and tactile signal, I ', forming a pair of the j-th group'_jAnd H'_jThe method is used for model testing, and F is the total amount of test data;

step (4-2), test set D_teThe data in (3) are input in pairs into the optimal cross-mode image restoration AGVI model trained in the step (3), and the output is the restoration image.

The invention also provides an attention-based cross-modal image restoration device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the attention-based cross-modal image restoration method when being loaded to the processor.

By adopting the technical scheme, the invention can produce the following technical effects:

(1) in order to accurately position the defect area of the image, the invention introduces an attention mechanism to carry out weight distribution, focuses on the defect area of the image so as to fully obtain the transfer characteristic of the characteristic defect area and improve the accuracy and the integrity of model repair.

(2) When the similarity of different modes is measured by the model, common semantic and label information is introduced, the difference between transfer characteristics and tactile characteristics is continuously reduced through the double constraints of semantic association and category measurement objective functions, and the partial tactile characteristics with the highest correlation degree with an image defect area in the tactile characteristics are extracted and used for cross-mode restoration of defect image data.

(3) The invention uses a decoder network structure, and realizes the high-quality and fine-grained comprehensive repair of the defective image through a perception constraint loss function and an appearance constraint loss function among pixels and an antagonistic loss function based on image distribution under the constraint of semantic and supervision information. Meanwhile, the image quality is improved from the semantics and the distribution on the basis of not increasing the complexity of the model.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

FIG. 2 is a schematic structural diagram of a cross-modal image restoration model based on an attention mechanism according to the present invention.

FIG. 3 is a structural frame diagram of the device of the present invention.

FIG. 4 is a graph showing the results of comparing the method of the present invention with the conventional method.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

An efficient and accurate cross-mode image restoration method needs to be drawn up, a defect area can be accurately positioned, effective information in a touch signal is mined, and high-quality image data restoration is achieved. In recent years, encoder-decoder networks have achieved good success in the field of image restoration, and attention mechanisms also provide a simple and efficient way for accurately positioning image defect areas. Therefore, the invention provides a cross-mode image restoration method based on an attention mechanism. Based on the selective attention capacity of an attention mechanism, the image defect region can be accurately positioned in a focused manner, and the transfer characteristic of the region is obtained; the correlation embedding learning module is combined with real label information to construct a correlation embedding learning space, when a semantic feature learning task is completed by a minimum semantic correlation target function, the difference between a target function minimum prediction label and a real label is measured by adopting classification based on cross entropy, a final total target function of semantic similarity learning among different modalities in a correlation embedding learning stage is obtained, and the most relevant tactile features in the tactile features and the image defect region are mined; under the constraints of a perception constraint loss function, an appearance constraint loss function and a resistance loss function which fit the difference and the distribution difference among pixels, the generated model based on the self-encoder utilizes the mined tactile features most relevant to the image defect region to realize the cross-mode restoration of the defect image data, and improves the image quality from the semantics and the distribution on the basis of not increasing the complexity of the model.

Specifically, as shown in fig. 1, the present invention relates to a cross-modal image restoration method based on attention mechanism, which specifically includes the following steps:

the method comprises the following steps of 1, selecting a multi-mode data set, wherein the multi-mode data set comprises defect image data, real image data and touch signals, and is divided into a training set and a testing set, and the method specifically comprises the following steps:

selecting three different modality data including defect image data I, real image data R and tactile signals H to form a multi-modal data set D; the real image data is an original color image signal, the tactile signal is tactile power spectral density obtained by preprocessing the tactile original signal, the defect image data is an image with a defect rate of lambda obtained by preprocessing the real image data, and the lambda is 40%.

And (1-2) for the data of different modes in the multi-mode data set D, counting the real label information Y of the data, namely using a one-hot code to print the category label of the content information represented by each data.

Step (1-3) of randomly selecting data with the proportion of alpha from the multi-modal data set D as trainingExercise and Collection D_trThe remaining data of 1-alpha ratio is used as test set D _teHere, α is 0.8.

Step 2, designing an attention-based cross-modal image restoration AGVI model, as shown in fig. 2, where the model includes four modules, namely a learnable feature extraction module, a feature attention transfer module, a correlation embedding learning module, and a cross-modal image restoration module, and the modules have functions of: firstly, extracting bottom layer characteristics of real image data, defect image data and tactile signals; secondly, according to the selective attention capacity of an attention mechanism, an image defect area is mainly positioned, and transfer characteristics of the area are obtained; then, introducing similarity measurement and supervision information, constructing a relevant embedded learning space by combining real label information under the constraint of semantic association and category measurement loss functions, minimizing a semantic feature learning task by using a semantic association target function, minimizing the difference between a predicted label and a real label by using a cross entropy-based classification measurement target function, obtaining a total target function of semantic similar learning between different modalities in a final relevant embedded learning stage, and excavating a haptic feature part which is most relevant to an image defect area in haptic features as far as possible; and finally, combining an inter-pixel perception constraint loss function, an appearance constraint loss function and an antagonistic loss function, and performing cross-mode restoration on defective image data by utilizing the mined tactile characteristics most related to the defective image area to ensure the quality of a terminal video signal, wherein the method specifically comprises the following steps:

Step (2-1), a learnable feature extraction module, which is used for extracting the features of the tactile signal, the defective image data and the real image data and participating in the subsequent end-to-end model training, and the concrete implementation process is as follows:

for the tactile signal H, a gate cycle Unit (GRU) and a 3-layer fully-connected network are used as a tactile mapping network to extract a tactile feature H, the GRU network consists of a reset gate and an update gate, the number of the set units is 128, the output dimensions of the fully-connected layer are 1024, 128 and 8, a 128-dimensional tactile feature H is output, and the last layer of the fully-connected layer is a sigmoid layer and used for outputting a tactile feature prediction label y^(h)。

For defect image data I and real image data R with a size of 128 × 128, in order to ensure consistency of distribution of defect images and real image features, a deep convolutional neural network is used as an image mapping network to extract hierarchical features, the network comprises 3 convolutional layers and 3 full-connected layers, the number of convolutional kernels is 256, 128 and 64, the size of the convolutional kernel is 4x4, the output dimension of the full-connected layer is 1024 and 128, and the last full-connected layer outputs 128-dimensional defect image features I and real image features R, and the specific process is as follows:

In the above formula, h, i and r are the tactile feature, the defective image feature and the real image feature, respectively, and the dimensions of h, i and r are

And

θ_hand theta_iRespectively haptic mapping network F_h(H；θ_h) And an image mapping network F_i(I/R；θ_i) The parameter set of (2).

Step (2-2), a transfer characteristic attention module for introducing an attention mechanism, accurately positioning the image defect area and acquiring the transfer characteristic representing the defect area, wherein the specific implementation process is as follows:

And

and

representing dimensions of the defect image feature i and the real image feature r, respectively. Then, normalized inner product is adopted to each i in defect image characteristics i_kAnd each of the real image features r_lCalculating to obtain cosine similarity between all feature units of the two features, wherein the cosine similarity is specifically represented as follows:

wherein, c_k,lExpressing cosine similarity matrix, | | | | | represents modular operation, < · represents normalized inner product operation.

Step B, transferring the most relevant part of each characteristic unit for characterizing the defect area in the defect image characteristic from the real image characteristic, namely a cosine similarity matrix c for k rows and l columns _k,lTakes the maximum value, this process is expressed as:

wherein, a_kFor distractive indexing, the feature metric that is most correlated with the k-th position of the defective image feature i in the real image feature r is represented.

t_k＝ra_k，

wherein, t_kRepresenting the a-th of the selected real image features r_kThe characteristic value of the k-th position in the transfer characteristic t is obtained by transferring the characteristic value of each position. Finally, in order to enhance the distinguishing capability of the transfer characteristics, the transfer characteristics are classified through a sigmoid layer to obtain a prediction label y of the transfer characteristics^(t)。

Step (2-3), a correlation embedding learning module, which is used for constructing a correlation embedding learning space by combining real label information, and measuring the difference between a target function minimum prediction label and a real label by adopting classification based on cross entropy while completing a semantic feature learning task by minimizing a semantic correlation target function, so as to obtain a total target function of semantic similarity learning between different modalities in a final correlation embedding learning stage, and mining the haptic features most relevant to an image defect area in the haptic features, wherein the specific implementation process is as follows:

Step A, constructing a relevant embedded learning space by using category label information Y ═ Y }, Y ∈ {1,2, …, C }, and completing a semantic feature learning task, wherein the process mainly comprises the step of minimizing a semantic association objective function L_remThe realization method comprises the following steps:

δ_pq＝h_p*t_q，

wherein y represents the true label, C represents the total number of training data classes, N represents the total number of training data, s_pqIs a class association factor, δ_pqIn order to be a feature correlation factor,

and

predictive labels for the pth haptic feature and the qth branch feature, respectively, (.)^TDenotes a transpose operation, and h_pAnd t_qRespectively representing the pth haptic feature and the qth transition feature. By observing the semantic relevance function L_remIt can be found that when s_pq＝1，δ_pqThe larger, L_remThe smaller and vice versa. The semantic relevance objective function L_remIn the correlation embedding learning space, the transfer features with the same semantics can assist the sense of touch to carry out semantic feature learning, namely, the tactile features with the highest correlation degree with the image defect region are extracted from the tactile features extracted by the learnable feature extraction module.

Wherein L is₁And L₂Classification metric objective functions representing the transfer features and the haptic features, respectively. y is_pAnd

a true label and a predicted label representing the qth branch feature, respectively.

Step C, synthesizing the loss functions in step A, B to obtain a total objective function of semantic similarity learning between different modalities in the relevant embedded learning stage, which can be expressed as:

L_sim＝L_rem+α₁L₁+α₂L₂，

wherein L is_simFor the final semantic similarity loss function, α₁And alpha₂Is hyperparametric, L_remTo represent the semantic relatedness between the two modalities.

(2-4) a cross-mode image restoration module, configured to perform cross-mode restoration on defective image data by using the mined tactile features most relevant to the defective image region in combination with the inter-pixel perception constraint loss function, the appearance constraint loss function, and the countermeasure loss function, and the specific implementation process is as follows:

Then restoring the image features

Inputting the decoder De to obtain a restored image

Namely, it is

As similar as possible to the real image R, this process is expressed as:

wherein, theta_deSet of network parameters, L, for decoder De_aAnd L_pRespectively an appearance constraint loss function and a perceptual constraint loss function,

is a function of distribution of the restored image

The expectation is that.

Is a feature extraction network similar to VGG, | | · |. non-woven phosphor₁Operate to take the norm of L1.

In this module, the decoder De includes 2 fully-connected layers and 4 deconvolution layers, the dimensions of the fully-connected layers are 128 and 1024 respectively, the number of deconvolution is 64, 128, 256 and 512, and the output is a 128 × 128 repaired image

The discriminator D includes 3 convolutional layers and 3 fully-connected layers, the convolutional layer output dimension is 512, 256, 128, the convolutional kernel size is 5x5, the fully-connected layer dimension is 1024, 128, 1, and finally, the number in a range of (0, 1) is output to represent the probability that the input image is a real image.

Step B, after the image is restored in a mode crossing manner, further restricting the distribution structure of the restored image in the step A; specifically, the distribution of real image data is learned using a countermeasure loss function, defining a countermeasure loss function L_advCan be expressed as:

And

are respectively the true image distribution function P_data(R)And a repair image distribution function

D (R; θ)_d) And

the probabilities of the true image and the restored image being true are identified for the discriminator D, respectively.

L_imp＝L_a+β₁L_p+β₂L_adv，

Step 3, training the cross-modal image restoration AGVI model by using a training set to obtain an optimal cross-modal image restoration AGVI model structure and network parameters, which are as follows:

step (3-1) of restoring the image

D_tr＝{(I_n,R_n,H_n,y_n),n＝1,2,…,N}，

Wherein, y_nDefective image data I for the n-th group participating in training_nTrue image data R_nAnd tactile signal H_nAnd N is the total capacity of the training data.

Step (3-2), initializing a network parameter set of the AGVI model, wherein the set comprises theta_h,θ_i,θ_de,θ_dThese parameters are initialized to a standard normal distribution. Wherein, theta_hAnd theta_iParameter sets for the haptic mapping network and the image mapping network, respectively; theta _deSet of network parameters, θ, for the decoder De_dIs the set of network parameters for arbiter D.

And (3-3) setting the total iteration number as G-400, and recording the specific iteration number by using G.

Step (3-4), training an AGVI model by adopting a random gradient descent method; the specific process is as follows:

firstly, setting a hyper-parameter alpha₁＝10^-3，α₂＝10^-4，β₁＝10^-4，β₂＝10^-5Learning rate μ for haptic mapping network and image mapping network₁0.0005, learning rate μ of decoder arbiter₂＝0.0003；

Step two, calculating the output of each network in the AGVI model:

h＝F_h(H；θ_h)；i＝F_i(I；θ_i)；r＝F_i(R；θ_i),

And

is the derivative.

If G is less than G, jumping to the step (3-4), adding 1 to the iteration times (G is G +1), and continuing the next iteration; otherwise, the iteration is terminated.

And (3) finally outputting the optimal AGVI model structure and network parameters after G-round iteration.

Step 4, performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model by using the touch signal and the defect image data in the test set, wherein the method specifically comprises the following steps:

step (4-1) and test set D divided in step 1_teComprises the following steps:

D_te＝{(I'_j,H'_j),j＝1,2,…,F},

wherein, I'_jAnd H'_jDefective image data and tactile signal I 'of the j-th group pair'_jAnd H'_jThe method is used for model testing, and F is the total amount of test data;

(4-2) test set D_teThe data in the step (3) are input in pairs into the optimal cross-mode image restoration AGVI model trained in the step (3), and the output is the restored image.

As shown in fig. 3, the present invention further relates to a cross-mode image restoration apparatus based on attention mechanism, which specifically includes: a memory, a processor, and a computer program stored on the memory and executable on the processor; wherein:

1. a memory is used for storing at least one program.

2. And using a processor for loading at least one program to execute the attention-based cross-mode image restoration method in the embodiment to realize high-quality and fine-grained restoration of the defective image.

Performance evaluation:

the invention carries out experiments according to the process, selects an LMT Material Surface standard data set as an experimental data set, and the data set is published in IEEE TRANSACTIONS ON HANDICS journal in 2017 by the literature of 'Multimodal Feature-based Surface Material Classification' (the authors are Matti Strese, Clemens Schuwerk, Albert Iepure, and Eckey Steinbach). Material information containing three instances of image, sound and tactile acceleration, 80% of the data (tactile, image) from each category were selected as the training set, and the remaining 20% were used as the test set.

The prior method I comprises the following steps: the document "PatchMatch: A random corrected texture algorithm for structural image editing" (by Connelly Barnes, Eli Shechman, Adam Finkelstein, Dan B Goldman et al), proposes a typical block matching image repair algorithm for image repair by finding similar blocks in non-defective regions and copying them to defective regions. This is a restoration method in a modality, and restoration is performed mainly based on information of an image itself.

The prior method II comprises the following steps: document "Context Encoders: Feature Learning by Inpaiting "(Author: deep Pathak, Philipp)

Jeff Donahue, Trevor Darrell) proposes an unsupervised image feature learning algorithm based on context pixels, and uses the codec as a basic structure and the pixel information of the image itself to complete the repair and filling of the content of the defect area.

The existing method is three: the document "unserviced restoration left with Deep adaptive genetic additive Networks" (author Radford Alec, Metz Luke, Chintala Soumith) proposes an image restoration model based on Deep convolution generation of a countermeasure network, which includes two models: a generating model G used for capturing data distribution and a discriminant model D used for estimating the probability that the sample comes from training data instead of G, and the image restoration is realized based on the mode of the game fighting.

The prior method comprises the following steps: the document "Text-Guided Neural Image Inpainting" (author Lisai Zhang, Qingcai Chen, Baotian Hu, Shuoran Jiang) proposes an Image restoration algorithm based on character guidance, which takes descriptive Text as a condition, compares semantic contents of a given Text and residual images through an attention mechanism, and finds out semantic contents filled for defective parts to carry out Image restoration.

The first reference method comprises the following steps: and removing the attention mechanism module, and only fusing the tactile features and the defect image features to realize the image restoration. The method can be used to verify the importance of the attention mechanism module.

The invention comprises the following steps: the method of the present embodiment.

The quality of the repaired image is evaluated mainly from two aspects of subjective qualitative evaluation and objective quantitative evaluation.

First, in terms of subjective qualitative assessment, fig. 4 shows image restoration effect graphs of AGVI models of the comparison scheme and the method of the present invention, respectively. The defect rate in this experiment was 40% and the size of the defect region was about 51 × 51. Each row of images represents the original image, the defect image, the prior method I, the prior method II, the prior method III, the prior method IV, the reference method I and the AGVI model of the method from left to right. As can be seen from fig. 4, compared with the depth image restoration method, the restoration result of the first conventional method has a poor visual perception effect, the texture cannot be clearly resolved, and the problems of image blur and detail loss are also obvious. The second existing method based on the self-encoder only has accurate and reasonable color repair to the damaged area, and the details of the structure and the texture are still very fuzzy. Compared with the first method and the second method, the repair result shows clearer texture and structural features, but has more obvious blurring and artifact phenomena at the edge. In the repair result of the AGVI model of the method, the semantic information of the repair area and the whole image is relatively fit, the artifacts and the fuzzy phenomenon at the edge disappear, and the repair effect is more vivid. Meanwhile, compared with the first reference method, the designed model has more complete texture and structure information and the image restoration performance is optimal.

Table I evaluation results of the present invention show

Method	MSE	SSIM
			Prior document I	130.230	0.587
Prior art document 2	125.622	0.616
			Prior art document three	126.145	0.606
Prior art document four	118.547	0.657
			Reference method 1	115.369	0.623
AGVI	100.750	0.845

Secondly, in the aspect of objective quantitative evaluation, a common Mean Square Error (MSE) and Structure metric (SSIM) in image quality evaluation indexes are adopted, and the smaller the MAE and the smaller the SSIM, the better the image quality of cross-modal restoration is. Table I shows the MSE scores and SSIM scores of the AGVI model of the invention and other comparative models, and the performance of the AGVI model was evaluated from two perspectives, image perception and structural comparison, respectively. As can be seen from the table, the AGVI model of the invention has the highest MSE score and the highest SSIM score. The first, second and third existing methods mainly perform intra-modal repair based on the information of the defective image, but when the image is severely defective, the repair effect is poor, the MSE score reaches about 130, and the SSIM value is only about 0.6. The existing method four, the reference method one and the AGVI model of the invention verify the content repair capability of the non-image modal information on the defect area and obviously improve the repair effect by the attention mechanism. In particular, the AGVI model designed by the method of the present invention combines the repair capability of the self-encoder with the selective attention capability of the attention mechanism, and the repair results have the best visual quality.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A cross-mode image restoration method based on an attention mechanism is characterized by comprising the following steps:

the method comprises the following steps of 1, selecting a multi-modal data set, wherein the multi-modal data set comprises defect image data, real image data and a touch signal, and is divided into a training set and a testing set;

the relevant embedding learning module is used for constructing a relevant embedding learning space by combining real label information, measuring the difference between a target function minimum prediction label and a real label by adopting classification based on cross entropy while completing a semantic feature learning task by a minimum semantic associated target function, obtaining a total target function of semantic similarity learning among different modalities in a final relevant embedding learning stage, and mining the most relevant tactile features in the extracted tactile features and the image defect area;

The cross-modal image restoration module is used for combining the perception constraint loss function, the appearance constraint loss function and the antagonistic loss function among the pixels and performing cross-modal restoration on the defective image data by utilizing the mined tactile characteristics most related to the defective image area;

and 4, performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model by using the touch signals and the defective image data in the test set.

2. The attention-based trans-modal image inpainting method of claim 1, wherein the selecting a multi-modal dataset in step 1 specifically comprises:

selecting three different modality data including defect image data I, real image data R and tactile signals H to form a multi-modal data set D; the real image data is an original color image signal, the touch signal is touch power spectral density obtained by preprocessing the touch original signal, the defect image data is an image with a defect rate of lambda obtained by preprocessing the real image data, and the value range of lambda is between 0 and 1;

Step (1-2), for data of different modes in the multi-mode data set D, carrying out statistics on real label information Y of the data, namely using a one-hot code to print a category label of content information represented by each data;

3. The attention mechanism-based cross-modal image inpainting method of claim 1, wherein the learnable feature extraction module in step 2 extracts features of the haptic signal, the defective image data, and the real image data, and specifically comprises:

For defect image data I and real image data R, extracting shallow defect image characteristics I and real image characteristics R by adopting an image mapping network formed by a depth convolution neural network, and the specific process is as follows:

And

4. The cross-modal image inpainting method based on the attention mechanism as claimed in claim 1, wherein the step 2 of acquiring the transfer characteristics characterizing the defect region by the attention transfer module specifically comprises:

And

i_kand r_lRespectively represent the defective image characteristics ik feature values and the l-th feature value in the real image feature r,

and

wherein, c_k,lExpressing a cosine similarity matrix, | | | | | represents a modular operation, and <, > represents a normalized inner product operation;

step B, transferring the most relevant part of each characteristic unit for characterizing the defect area in the defect image characteristic from the real image characteristic, namely a cosine similarity matrix c for k rows and l columns _k,lTakes the maximum value for each column of (a), this process is expressed as:

t_k＝ra_k，

wherein, t_kRepresenting the selection of the a-th in the real image features r_kThe characteristic value of each position is transferred to obtain the characteristic value of the kth position in the transfer characteristic t, and then the characteristic value is subjected to characteristic classification through a sigmoid layer to obtain a prediction label y of the transfer characteristic^(t)。

5. The attention mechanism-based cross-mode image restoration method according to claim 1, wherein the step 2 of mining the haptic features most relevant to the image defect area in the haptic features by the relevant embedded learning module specifically comprises:

step A, constructing a relevant embedding learning space by using real label information Y ═ Y }, Y ∈ 1,2, … and C to finish a semantic feature learning task, wherein the process mainly comprises the step of minimizing a semantic association objective function L_remThe realization method comprises the following steps:

δ_pq＝h_p*t_q，

wherein y represents the true label, C represents the total number of training data classes, N represents the total number of training data, s _pqIs a class association factor, δ_pqIn order to be a feature association factor, the feature association factor,

and

a predictive label for the p-th haptic feature and a predictive label for the q-th branch feature, respectively, (. cndot.)^TDenotes a transpose operation, h_pAnd t_qRespectively representing the p tactile feature and the q transfer feature; the semantic association objective function ensures that in the relevant embedded learning space, transfer features with the same semantics can assist the haptic sense to proceedSemantic feature learning, namely extracting the haptic features with the highest correlation degree with the image defect area from the haptic features;

true and predicted labels representing the qth branch feature, respectively;

L_sim＝L_rem+α₁L₁+α₂L₂，

wherein L is_simFor the final semantic similarity loss function, α ₁And alpha₂Is a hyperparameter, L_remTo represent the semantic relatedness between two modalities.

6. The attention-based cross-modal image inpainting method of claim 1, wherein the cross-modal image inpainting module in step 2 performs cross-modal inpainting on the defective image data by using the mined tactile features most relevant to the defective image region, and specifically comprises:

step A, adopting a decoder network to realize cross-modal image restoration; specifically, adding the tactile feature h and the defect image feature i to obtain a repaired image feature

Then restoring the image features

Inputting the decoder De to obtain a restored image

Namely, it is

As similar as possible to the real image R, this process is expressed as:

wherein, theta_deSet of network parameters, L, for the decoder De_aAnd L_pRespectively, appearance constraint loss functionAnd a perceptual constraint loss function that,

is a function of distribution of the restored image

In the expectation that the position of the target is not changed,

wherein L is_advIs a penalty function of the discriminator D, θ_dIs a set of network parameters for the arbiter D,

and

D (R; theta)_d) And

the probabilities that the real image and the restored image are true are respectively identified for the discriminator D;

L_imp＝L_a+β₁L_p+β₂L_adv，

7. The cross-modal image restoration method based on the attention mechanism according to claim 1, wherein the training of the cross-modal image restoration AGVI model by using the training set in the step 3 specifically comprises:

step (3-1) of restoring the image

D_tr＝{(I_n,R_n,H_n,y_n),n＝1,2,…,N}，

h＝F_h(H；θ_h)；i＝F_i(I；θ_i)；r＝F_i(R；θ_i),

step three, starting iteration; updating the parameter set of each network from the direction of the negative gradient of the target based on a gradient descent method and an Adam optimizer:

And

haptic mapping network F after g +1 and g iterations, respectively_hImage mapping network F_iDecoder network De, network parameter set of discriminator network D; v is a derivative;

if G is less than G, jumping to the step (3-4), and continuing the next iteration, wherein G is G + 1; otherwise, terminating the iteration;

And (3-5) finally outputting the structure and the network parameters of the optimal cross-mode image restoration AGVI model after G-round iteration.

8. The cross-modal image restoration method based on the attention mechanism according to claim 1, wherein the step 4 of performing cross-modal image restoration based on the optimal cross-modal image restoration AGVI model specifically comprises:

step (4-1), test set D divided in step 1_teComprises the following steps:

D_te＝{(I'_j,H'_j),j＝1,2,…,F},

step (4-2), test set D_teThe data in the step (3) are input in pairs into the optimal cross-mode image restoration AGVI model trained in the step (3), and the output is the restored image.

9. An attention-based cross-modal image restoration device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the computer program, when loaded into the processor, implements an attention-based cross-modal image restoration method according to any one of claims 1 to 8.