CN109033095B

CN109033095B - Target transformation method based on attention mechanism

Info

Publication number: CN109033095B
Application number: CN201810866277.0A
Authority: CN
Inventors: 胡伏原; 叶子寒; 李林燕; 孙钰; 付保川
Original assignee: Suzhou University of Science and Technology
Current assignee: Suzhou University of Science and Technology
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2022-10-18
Anticipated expiration: 2038-08-01
Also published as: CN109033095A

Abstract

The invention relates to a target transformation method based on an attention mechanism, which comprises the following steps: training a neural network model: step 1, initializing parameters of a neural network model by using random numbers; step 2, inputting an image X belonging to the category X into a generator G of the model, entering an encoding stage, and calculating a first-layer characteristic diagram f by the X through a convolution layer ¹ . And performing target transformation on the image by using the trained neural network model, and introducing an attention mechanism into the model to enable the model to identify a target object to be converted in a target change task so as to distinguish the target from the background. Meanwhile, the consistency of the background of the original image and the converted image is ensured by constructing an attention consistency loss function and a background consistency loss function.

Description

Target transformation method based on attention mechanism

Technical Field

The invention relates to image translation, in particular to a target transformation method based on an attention mechanism.

Background

Object transformation (Object transformation) is a special task of image translation, whose purpose is to transform a specific type of Object in an image into another type of Object. Image translation (Image translation) aims at converting an original Image into an Image of a target style by learning a mapping relationship between two types of images, and has been applied to many aspects such as Image super-resolution reconstruction, artistic style migration, and the like in recent years. Researchers have proposed many efficient transformation methods under supervised conditions. However, the conversion method under the unsupervised condition becomes a research hotspot in image translation due to the large labor cost and time cost required for acquiring paired data. Visual Attribute Transfer (VAT) is a representation of the convolutional neural network CNN-based approach, which uses features at different levels in the model to match the most likely corresponding features in another graph. In addition, a method using a Generative Adaptive Network (GAN) achieves more significant effects than a method based on a convolutional neural network. Isola P et al explored the potential of GAN in image translation tasks. Subsequently, cycle-dependent Loss was proposed by Zhu j.y et al to solve the problem of unsupervised image translation, which assumed that the mapping relationship learned in the image translation task was a bi-directional mapping, and thus enhanced the effect of image translation of the model in an unsupervised environment.

The traditional technology has the following technical problems:

most of the current image translation methods do not take into account the difference between the conversion object and the background region. In a target change task, most models are difficult to effectively distinguish a conversion target from a background, and the consistency of an original image background and a conversion image background cannot be ensured. Therefore, the model generates the effects of blurring, discoloring and the like on the image background in the conversion process, and the quality of the converted image is reduced.

Disclosure of Invention

In view of the above, it is necessary to provide an object transformation method based on attention mechanism, which can distinguish the object from the background by introducing attention mechanism into the model to enable the model to identify the object to be transformed in the task of object transformation. Meanwhile, the consistency of the background of the original image and the converted image is ensured by constructing an attention consistency loss function and a background consistency loss function.

An attention-based target transformation method, comprising:

training a neural network model:

step 1, initializing parameters of a neural network model by using random numbers;

step 2, inputting an image X belonging to the category X into a generator G of the model, entering an encoding stage, and calculating a first-layer characteristic diagram f by the X through a convolution layer ¹ ；

Step 3, then f ¹ Two branch networks will be traversed: (a) One convolution layer obtains the characteristic diagram of the second layer without attention mask processing

(b) First passes through two convolutional layersThen passing through a deconvolution layer to obtain an

Corresponding attention mask M ² (ii) a Will M ² And

element by element multiplication, the product being obtained and

are added one by one to obtain a processed second layer characteristic diagram f ² ；

Step 4, f ² Obtaining the characteristic diagram f of the next layer according to the mode of the step 3 ³ (ii) a Then, f ³ Further fine features are obtained by 6 layers of residual convolution layers with convolution kernel size of 3 x 3 and step size of 1;

step 5, entering a decoding stage, and taking the deconvolution layer as a decoder; f. of ³ Two branch networks will be traversed: (a) An deconvolution layer is subjected to a second layer profile without attention masking

(b) First through two deconvolution layers and then through one convolution layer to obtain a sum

Corresponding attention mask M ⁴ (ii) a Will M ⁴ And

element by element multiplication, the product being obtained and

are added one by one to obtain a processed second layer characteristic diagram f ⁵ ；

Step 6, entering an output stage, f ⁵ Two branch networks will be traversed: (a) an deconvolution layer to obtain a transformed image y'; (b) The y' is obtained by two deconvolution layers and one convolution layerAttention mask M _G(x) ；

Step 7, y 'is input into another generator F, and the same operation as in step 2-6 is performed to obtain x' and the corresponding attention mask M _F(G(x)) ；

Step 8, inputting x and x' into a discriminator D _x Middle, discriminator D _x The probability that the input image belongs to the category X is returned; likewise, y and y' are input to the discriminator D _Y Obtaining the probability that Y and Y' belong to the category Y; the value of the opposition loss function is thus calculated:

step 9, calculating the value of the cycle consistent loss function according to x, x ', y, y':

L _cyc (G，F)＝||x′-x|| ₁ +||y′-y|| ₁ #(3)

step 10, use M _G(x) Separating the background in x and y' from the conversion target, calculating the background change loss:

L _bg (x，G)＝γ*||B(x，M _G(x) )-B(y′，M _G(x) )|| ₁ #(4)

B(x，M _G(x) )＝H(x，1-M _G(x) )#(5)

γ is set to 0.000075 to 0.0075; the value of the H (K, L) function is that elements in K are multiplied by elements in L one by one; also, M may be used _F(G(x)) Calculating the background change loss L by using y and x _bg (y，F)；

Step 11, with M _G(x) And M _F(G(x)) Calculating the attention change loss:

L _att (x，G，F)＝α*||M _G(x) -M _F(G(x)) || ₁ +β*(M _G(x) +M _F(G(x)) )#(6)

α is set to 0.000003 to 0.00015, β is set to 0.0000005 to 0.00005;

step 12, adjusting model parameters according to the error obtained in the previous step 8-11 by a back propagation algorithm with the learning rate of 0.00002 to 0.002;

step 13, taking y as an input image, and calculating an error through the operations of the steps 2 to 11, except that the y passes through a generator F and then a generator G; adjusting the model parameters according to the method in the step 12;

step 14, continuously repeating the steps 2-13 until the model parameters are converged;

and carrying out target transformation on the image by using the neural network model obtained by training.

The above target transformation method based on the attention mechanism enables the model to identify the target object needing to be converted in the target change task by introducing the attention mechanism into the model, so as to distinguish the target from the background. Meanwhile, the consistency of the background of the original image and the converted image is ensured by constructing an attention consistency loss function and a background consistency loss function.

In another embodiment, α is set to 0.000015.

In another embodiment, β is set to 0.000005.

In another embodiment, γ is set to 0.00075.

In another embodiment, the back propagation algorithm is optimized by Adam.

In another embodiment, the learning rate of the back propagation algorithm is 0.0002.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods when the program is executed.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of any of the methods.

A processor for running a program, wherein the program when running performs any of the methods.

Drawings

Fig. 1 is an overall schematic diagram of a model structure of an attention-based target transformation method according to an embodiment of the present application.

Fig. 2 shows three different DAU structures in an attention-based target transformation method according to an embodiment of the present application. (DAU) _decode And DAU _final Structurally, the Attention Mask depth is different only for output. )

FIG. 3 is a comparison of experimental results of an attention-based objective transformation method with the CycleGAN and VAT methods on ImageNet datasets, provided by an embodiment of the present application.

FIG. 4 is a comparison of the results of experiments on CelebA data sets with the cycleGAN and VAT methods using an attention-based target transformation method provided in the examples of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

An attention-based target transformation method, comprising:

training a neural network model:

step 2, inputting an image X belonging to the category X into a generator G of the model, entering a coding stage, and calculating a first-layer characteristic diagram f by the X through a convolution layer ¹ ；

(b) First through two convolutional layers and then through a deconvolution layer to obtain a sum

Corresponding attention mask M ² (ii) a Will M ² And with

Element by element multiplication, the product being further multiplied by

(b) First through two deconvolution layers and then through one convolution layer to obtain the sum

Attention mask M ⁴ (ii) a Will M ⁴ And

element by element multiplication, the product being further multiplied by

Step 6, entering an output stage, f ⁵ Two branch networks will be traversed: (a) an deconvolution layer to obtain a transformed image y'; (b) Obtaining an attention mask M corresponding to y' through two deconvolution layers and a convolution layer _G(x) ；

Step 7, y' will beInputting into another generator F, and obtaining x' and corresponding attention mask M after the same operation as the step 2-6 _F(G(x)) ；

Step 8, inputting x and x' into a discriminator D _x Middle, discriminator D _x The probability that the input image belongs to category X will be returned; likewise, y and y' are input to the discriminator D _Y Obtaining the probability that Y and Y' belong to the category Y; the value of the opposition loss function is thus calculated:

L _cyc (G，F)＝||x′-x|| ₁ +||y′-y|| ₁ #(3)

L _bg (x，G)＝γ*||B(x，M _G(x) )-B(y′，M _G(x) )|| ₁ #(4)

B(x，M _G(x) )＝H(x，1-M _G(x) )#(5)

γ is set to 0.000075 to 0.0075; the value of the H (K, L) function is that elements in K are multiplied by elements in L one by one; also, M may be used _F(G(x)) Calculating the background change loss L from y and x _bg (y，F)；

Step 11, with M _G(x) And M _F(G(x)) Calculating attention change loss:

α is set to 0.000003 to 0.00015, β is set to 0.0000005 to 0.00005;

The above object transformation method based on the attention mechanism enables the model to identify the object needing to be transformed in the object change task by introducing the attention mechanism into the model, so as to distinguish the object from the background. Meanwhile, the consistency of the background of the original image and the converted image is ensured by constructing an attention consistency loss function and a background consistency loss function.

In another embodiment, α is set to 0.000015.

In another embodiment, β is set to 0.000005.

In another embodiment, γ is set to 0.00075.

In another embodiment, the back propagation algorithm is optimized by Adam.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods when executing the program.

A specific application scenario of the present invention is described below:

the invention studies to enable a model to distinguish objects from the background while learning to map an image set X containing one type of object to an image set Y containing another type of object. The following figure shows the architecture of the model herein, our model comprising 4 modules: generator G, generator F, and discriminator D _X And a sum discriminator D _Y . G is used to learn the mapping function G: x → Y. The generator F learns another inverse mapping function F: y → X. D _X For distinguishing between the original image x and the converted image F (y), and, correspondingly, D _Y To distinguish between the original image y and the transformed image G (x). We build a Deep Attention Unit (DAU) in both generator G and generator F to extract the critical areas.

(1) Depth attention unit:

attention was calculated separately on each modality as follows: attention mask M ∈ R is extracted herein by constructing Deep Attention Unit (DAU) ³ The model has the capability of distinguishing the target from the background. The structure of the generator after the depth attention unit is added is shown in the lower part of fig. 1.

In the encoding Stage (Encode Stage), as shown in the lower half of FIG. 1, a feature map f of the n-1 st layer of an input image x is given ^n-1 (n is equal to {2,3 }), and a convolution layer is used as an encoder to obtain a characteristic diagram of the next layer of x

As shown in FIG. 2 (a), DAU will f ⁿ After being encoded by two convolutional layers, the coded signal is further encoded by a sigmoid function (y = 1/(1 + e) ^-x ) Performing one-time up-sampling on the deconvolution layer as an activation function to obtain a feature map

Mask M with consistent foot size ⁿ ：

In the decoding stage and the output stage, as shown in FIG. 3 (b), a deep attention unit, denoted as DAU, is used herein as well _decode And DAU _final . But its process and DAU _encode In contrast:

the value range of the sigmoid function is [0,1 ]]In between, therefore attention is paid to the mask M ⁿ Can be seen as a pair

The weight distribution of (2) can enhance the expression of meaningful features and suppress meaningless information. We will M ⁿ And with

An element-wise product is made, denoted as H (#). Furthermore, referring to the residual network and the residual attention network, we add shortcut to suppress the gradient vanishing problem.

Finally obtaining the characteristic diagram f of the n-th layer through the operation ⁿ ：

(2) Round consistent loss function:

CycleGAN uses a cyclic consistent loss function to improve the image translation effect, which is referred to as Dual learning (Dual learning) in the field of machine translation, and it is considered that for each image X in the data set X, this conversion cycle can map X back to the original image: x '= F (y') = F (G (x)) ≈ x. Accordingly: y '= F (x') = G (F (x)) ≈ y. Since the model herein is also a dual learning structure. We also use the round robin uniform loss function

Improving the effect of converting the model into the image:

L _cvc (G，F)＝||F(G(x))-x|| ₁ +||G(F(y))-y|| ₁ #(6)

(3) Attention consistent loss function:

considering that the spatial position of the target in the image should remain unchanged in the conversion process F (G (x)), an Attention Consistency Loss function (Attention Consistency Loss) is therefore constructed herein to constrain the model:

L _att (x，G，F)＝α*||M _G(x) -M _F(G(x)) || ₁ +β*(M _G(x) +M _F(G(x)) )#(7)

M _G(x) and M _F(G(x)) Representing the masks output by the model in the last layer during the generation of G (x) and F (G (x)), respectively, where the values of the elements represent the probabilities that the corresponding elements belong to the conversion targets in the original image. The second term is a regularization term that prevents over-fitting of the model. α, β are the weights of both terms in the formula.

(4) Background consistent loss function:

when the DAU obtains the attention mask corresponding to the feature map, the model can distinguish the target from the background. A Background consistent Loss function (Background Consistency Loss) was constructed here:

L _bg (x，G)＝γ*||B(x，M _G(x) )-B(G(x)，M _G(x) )|| ₁ #(8)

B(x，M _G(x) )＝H(x，1-M _G(x) )#(9)

γ is a hyperparameter. B (x, M) _G(x) ) Is a background function, 1-M _G(x) The value of the middle element represents the probability that the corresponding element belongs to the background in the original image. For x and 1-M _G(x) And obtaining the background of x by calculating an element-wise product. B (G (x), M) _G(x) ) The same is true.

(5) Background consistent loss function:

the effectiveness of the generated image may be enhanced by an adaptive Loss. For the mapping function G: x → Y and its discriminator D _Y Expressed as:

g will attempt to make the generated image G (x) indistinguishable from the image of the data set Y, and D _Y The aim is to distinguish G (x) from y as much as possible. The goal of G is to minimize this objective function, instead D will try to maximize it.

(6) The complete objective function:

this translates into a min-max optimization problem:

the invention has the advantages that the model can effectively identify the target object in the image, neglect irrelevant background and further improve the final visual nominal effect, and the model obtains the best effect on a plurality of comparison experiments with other current most methods.

The text firstly constructs a Deep Attention Unit (DAU) module based on an Attention accumulation mechanism, and the purpose of the module is to identify a target object in an image, so as to guide a model to eliminate background interference and further prompt a conversion effect.

The experiment was validated on both data sets, imageNet and CelebA. ImageNet is a large-scale image dataset specifically used for machine vision studies. We extracted 995 apple images, 1019 orange images, 1067 horse images and 1334 zebra images from ImageNet for training the model.

Fig. 3 shows the results of comparative experiments on the ImageNet dataset and fig. 4 shows the results of comparative experiments on the CelebA dataset. It is clear that CycleGAN and VAT have a great influence on the background of the original image. For example, in the second column of fig. 3 (a) (b), the leaves fade from green to gray. In fig. four, the conversion of VAT completely fails: the face of the transformed image has been completely deformed and the due transformation features have not appeared. For example, in fig. 4 (b), the conversion between the non-glasses image → glasses image is not converted into an image with glasses on one face by VAT. However, the DAU-GAN method not only successfully completes the conversion task, but also effectively retains the background of the original image. For example, in the figure 3 (c) conversion of horse image → zebra image, the zebra image generated by DAU-GAN not only preserves the background with more natural streaks.

Table 1 mean change value of background for each transformed image.

To more accurately demonstrate the effectiveness of our method, we quantitatively counted the mean change in the transformed image background over the test set. Table 1 shows the results of the experiment. For each conversion, the background variation value of the DAU-GAN converted image is minimal. It strongly demonstrates that our model can preserve the background in target changes.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims

1. An attention-based target transformation method, comprising:

training a neural network model:

Corresponding attention mask M ² (ii) a Will M ² And

element by element multiplication, the product being further multiplied by

Step 4, f ² Obtaining the characteristic diagram f of the next layer according to the mode of the step 3 ³ (ii) a Then, f ³ Further fine features are obtained by 6 layers of residual convolution layers with convolution kernel size of 3 x 3 and step length of 1;

Corresponding attention mask M ⁴ (ii) a Will M ⁴ And

element by element multiplication, the product being further multiplied by

Step 6, enter the output stage, f ⁵ Two branch networks will be traversed: (a) obtaining a transformed image y' from a deconvolution layer; (b) Obtaining an attention mask M corresponding to y' through two deconvolution layers and a convolution layer _G(x) ；

Step 8, inputting x and x' into a discriminator D _x Middle, discriminator D _x The probability that the input image belongs to the category X is returned; similarly, y and y' are input to a discriminator D _Y Obtaining the probability that Y and Y' belong to the category Y; the value of the opposition loss function is thus calculated:

L _cyc (G，F)＝||x′-x|| ₁ +||y′-y|| ₁ #(3)

L _bg (x，G)＝γ*||B(x，M _G(x) )-B(y′，M _G(x) )|| ₁ #(4)

B(x，M _G(x) )＝H(x，1-M _G(x) )#(5)

γ is set to 0.000075 to 0.0075; the value of the H (K, L) function is that elements in K are multiplied by elements in L one by one; likewise, with M _F(G(x)) Calculating the background change loss L from y and x _bg (y，F)；

Step 11, with M _G(x) And M _F(G(x)) Calculating the attention change loss:

α is set to 0.000003 to 0.00015, β is set to 0.0000005 to 0.00005;

2. The attention-based mechanism target transformation method of claim 1, wherein α is set to 0.000015.

3. The attention-based mechanism target translation method of claim 1, wherein β is set to 0.000005.

4. The attention-based mechanism target translation method of claim 1, wherein γ is set to 0.00075.

5. An attention-based mechanism target transformation method according to claim 1, characterized in that the back propagation algorithm is Adam optimized.

6. The attention-based mechanism target transformation method of claim 1, wherein the back-propagation algorithm has a learning rate of 0.0002.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 6 are implemented when the program is executed by the processor.

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

9. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 6.