CN113379655B - Image synthesis method for generating antagonistic network based on dynamic self-attention - Google Patents
Image synthesis method for generating antagonistic network based on dynamic self-attention Download PDFInfo
- Publication number
- CN113379655B CN113379655B CN202110537516.XA CN202110537516A CN113379655B CN 113379655 B CN113379655 B CN 113379655B CN 202110537516 A CN202110537516 A CN 202110537516A CN 113379655 B CN113379655 B CN 113379655B
- Authority
- CN
- China
- Prior art keywords
- attention
- feature map
- self
- generator
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001308 synthesis method Methods 0.000 title claims abstract description 8
- 230000003042 antagnostic effect Effects 0.000 title abstract description 8
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000009826 distribution Methods 0.000 claims abstract description 29
- 238000005070 sampling Methods 0.000 claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 24
- 230000009467 reduction Effects 0.000 claims description 33
- 238000010586 diagram Methods 0.000 claims description 28
- 230000009466 transformation Effects 0.000 claims description 17
- 238000013528 artificial neural network Methods 0.000 claims description 15
- 238000013527 convolutional neural network Methods 0.000 claims description 15
- 239000013598 vector Substances 0.000 claims description 14
- 239000011159 matrix material Substances 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 210000002569 neuron Anatomy 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 abstract description 21
- 238000004422 calculation algorithm Methods 0.000 abstract description 10
- 238000004364 calculation method Methods 0.000 abstract description 10
- 230000006870 function Effects 0.000 description 25
- 230000015572 biosynthetic process Effects 0.000 description 16
- 238000003786 synthesis reaction Methods 0.000 description 16
- 230000004913 activation Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image synthesis method based on a dynamic self-attention generation antagonistic network, and belongs to the field of computer vision. The method comprises the steps of firstly selecting a generated countermeasure network as a basic frame, normalizing a training picture, and sampling a normal distribution to obtain a noise sample. The invention uses Linformer algorithm and dynamic convolution algorithm for reference, improves the multi-head self-attention mechanism used in the invention, and increases the connection and constraint between each self-attention head, so that the self-attention heads can learn various mode knowledge of the image. The invention fully exerts the advantages of the dynamic self-attention mechanism and the generation of the confrontation network, and the proposed dynamic self-attention module can greatly reduce the calculation complexity of the multi-head self-attention mechanism and improve the problems of mode collapse, unstable training and the like of the generation of the confrontation network.
Description
Technical Field
The invention belongs to the field of computer vision, and mainly relates to the problem of image synthesis; the method is mainly applied to the fields of image restoration, editing, enhancement, retrieval and the like.
Background
Image synthesis is a technique of understanding image contents using computer vision techniques and generating a specified image as needed. It can be generally divided into two types: unsupervised image synthesis and supervised image synthesis. Unsupervised image synthesis refers to learning a mapping function from a noise distribution to an image distribution, and synthesizing an image by the mapping function. Supervised image synthesis refers to learning the condition distribution of image data and further generating an image under a given condition. Image synthesis, which is a hot problem in the field of computer vision, is the basis for image restoration, editing and enhancement. The method can solve the problem of visual data loss in the fields of military affairs, medical treatment, safety and the like, and can be applied to the fields of film and television entertainment, planar design and the like.
Since human beings are sensitive to information such as details and edges of images, image synthesis algorithms need to guarantee authenticity and diversity of image synthesis. To improve the realism and diversity of composite images, many scholars use depth generation techniques to improve past image synthesis algorithms. However, when the target data distribution is very complex, early depth generation methods often face the problems of large computation and difficulty in solving. The method of creating a competing network proposed by Goodfellow et al in 2014 solved this problem excellently. Compared with the prior deep generation method, the generation of the countermeasure network has the following obvious advantages: 1. the method for generating the confrontation network can generate samples with larger dimensions only by increasing the output dimension of the generator and the input dimension of the discriminator. 2. The generation of the countermeasure network does not make any prior assumptions about the data distribution, and therefore does not require manual design of the distribution of the model. 3. The data distribution synthesized by the method for generating the countermeasure network is very close to the data distribution of a real sample, and the authenticity and diversity of the synthesized image can be well ensured. Because of these obvious advantages of creating a competing network, the present invention performs the image synthesis task using the method of creating a competing network.
At present, the existing method for generating the confrontation network still has the problems of mode collapse, unstable training and the like. To improve these problems, Goodfellow et al model long-range correlations between synthesized pixels by introducing a non-local Self-attention mechanism, and their proposed Self-attention mechanism makes a very large breakthrough in image synthesis tasks in various fields against the network (SAGAN). Reference: H.Zhang, I.Goodfellow, D.Metaxas, et al.self-authentication genetic additive networks [ C ]. International conference on machine learning,2019, 7354-7363. However, the model has the problems of high computational complexity, low computational efficiency, poor long-range correlation performance among modeling pixels and the like. The invention provides an image synthesis method for generating an antagonistic network based on dynamic self-attention by using a Linformer algorithm and a dynamic convolution algorithm on the basis of an SAGAN model, and obtains an excellent result. Reference documents: chen, x.dai, m.liu, et al.dynamic restriction: the IEEE Conference on Computer Vision and Pattern Recognition, 2020, 11030-.
Disclosure of Invention
The invention relates to an image synthesis method for generating an antagonistic network based on dynamic self-attention, which solves the problems of high computational complexity, low computational efficiency, poor long-range correlation performance among modeling pixels and the like in the conventional method for generating the antagonistic network based on a self-attention mechanism. The method comprises the steps of firstly selecting and using a generated countermeasure network as a basic frame, normalizing a training picture, and sampling a normal distribution to obtain noise. Meanwhile, the invention uses Linformer algorithm and dynamic convolution algorithm for reference, improves the multi-head self-attention mechanism used in the invention, and increases the connection and constraint between each self-attention head, so that the self-attention heads can learn various mode knowledge of the image. In the training process, the method inputs noise and pictures into the network simultaneously, and trains the model by utilizing a generated confrontation network algorithm. After training is completed, the synthesis task of the image can be completed by inputting noise in the generation network. By the method, the advantages of the dynamic self-attention mechanism and the generation of the confrontation network are fully exerted, the calculation complexity of the multi-head self-attention mechanism can be greatly reduced by the proposed dynamic self-attention module, and the problems of mode collapse, unstable training and the like of the generation of the confrontation network are solved.
For convenience in describing the present disclosure, certain terms are first defined.
Definition 1: generating a countermeasure Network (GAN) is composed of a set of antagonistic neural networks (called generator and arbiter, respectively) that choose to distribute p from a certain data z (z) as input, noise z obtained by random sampling. Then, the generator G establishes mapping between the data distribution and the target data distribution, the input of the discriminator is the real sample or the output of the generator G, and the discriminator is used for distinguishing the output of the generator from the real sample as much as possible. The output of discriminator D is a scalar D (x) which represents the probability that the input sample x is from real data rather than synthetic data. In the actual training process, an alternating training mode is generally adopted to make the arbiter and the generator advance to the optimal direction. Firstly, fixing parameters of a generator G, and training a discriminator D by maximizing a target function, so that the discrimination accuracy of the discriminator D is optimized; then, parameters of the discriminator D are fixed, and the result generated by the discriminator G is close to real data by minimizing a target function, so that the accuracy of the discriminator is reduced; the process of alternating training is then repeated, and when the result of the generator is consistent with the true data distribution, the objective function reaches a global optimal solution. The objective function of the optimization process can be expressed by the following formula:
In the above formula min and max represent the mathematical signs min and max, respectively, E [ ·]Representing the mathematical expectation of the distribution of the sought data, x representing the real data, p data (x) Is a true data distribution, z denotes the distribution p from a certain data z (z) randomly sampling the resulting vector.Respectively represent the pairs x ∈ p data (x) And z ∈ p z (z) expectation.
Definition 2: a non-local self-attention mechanism. The non-local self-attention mechanism typically includes 3 modules, query, key, and value. The query and the key firstly carry out correlation operation, and finally carry out weighting operation with value, wherein the core operator is i denotes the index of the output position and j denotes the index enumerating all possible positions. x represents the input image, f (-) is used to compute the correlation between the i location pixel and all possible location pixels, g (x) represents the arbitrary transformation of x, and c (x) is used to normalize the computed result.
Definition 3: a multi-head self-attention mechanism. The multi-head self-attention mechanism can directly model more complex long-range correlation between pixels, and each self-attention head can learn correlation matrixes of different modes, which plays an important role in improving the generated result. Its core operators are:
MultiHead(Q,K,V)=Concat(head 1 ,...,head h )W O
The calculation of each of the attention heads in the formula is the same as the calculation of the non-local attention mechanism in definition 2, except that the calculation results obtained by the calculation heads are spliced and then processed by the matrix W O Projected back to the original size. Since the parameters between each of the self-attention heads are not shared, they can be operated in parallel to greatly reduce the computational complexity.
Definition 3: a neural network. Neural networks typically include an input layer, an output layer, and a hidden layer. The input layer is a collection of neurons that accept a large amount of non-linear input data. The output layer is the neuron combination of the final output result. The hidden layer is a layer composed of a plurality of neurons between the input layer and the output layer.
Definition 4: a non-linear activation function. The nonlinear activation function is an indispensable basic unit in the neural network, and has the functions of enhancing the nonlinearity of the network and improving the modeling capability of the network on nonlinear data. Common activation functions include Sigmoid function, tanh function, modified linear unit ReLU.
Definition 5: the image convolution is convolved with the transpose. Image convolution and transposed convolution are commonly used in feature extraction and image synthesis, respectively, in deep learning, and can be viewed as operations in opposite directions. The convolution operation can realize functions similar to human eyes, namely extracting local features of the image, and meanwhile, the convolution operation realizes functions of parameter sharing and data dimension reduction. The transposed convolution is also called deconvolution, and the low-dimensional image features can generate a high-dimensional image through a series of transposed convolution operations, so that the transposed convolution is mostly used for image generation.
Definition 6: convolutional Neural Network (CNN). Convolutional neural networks are typically composed of one or more convolutional layers together with a top fully-connected layer, and often also contain pooling layers. Compared with other depth models, the convolutional neural network can obtain better results in the field of image and voice recognition.
Definition 7: residual Neural Network (Residual Neural Network). Compared with the traditional convolutional neural network, the residual error network adds a short connection mode which is proved to exceed the traditional straight-through convolutional neural network in efficiency and accuracy. When the network is trained, the residual error network module has obvious advantages, and the gradient which is propagated reversely can be directly propagated from the high layer to the bottom layer when passing through the residual error network module, so that the network can select which modules are to be adjusted, and the network module can be kept stable during training.
Definition 8: a normal distribution. Also known as gaussian distribution, is a very common continuous probability distribution. Normal distributions are statistically significant and are often used in natural sciences and engineering to represent an unknown random variable. If the random variable x, its probability density function satisfies Where μ is the mathematical expectation of a normal distribution, σ 2 The variance of a normal distribution is said to satisfy the normal distribution, and is often referred to as
Definition 9: softmax function. The softmax function can compress a K-dimensional vector x containing arbitrary real numbers into another K-dimensional real vector softmax (x) such that each element ranges between (0, 1) and the sum of all elements is 1. The formula can be expressed as:
definition 10: and (5) one-hot coding. Because the computer can not understand the non-binary data, the one-hot code can convert the class label data into a uniform binary digital format, so that the processing and calculation of a machine learning algorithm are facilitated. The image tag in the invention is converted into a one-hot vector with fixed dimension by using the coding method. Most of numbers in the one-hot vector data are 0, and the use amount of a computer memory can be saved by using the sparse data structure.
Therefore, the technical scheme of the invention is an image synthesis method for generating an antagonistic network based on dynamic self-attention, which comprises the following steps:
step 1: preprocessing the data set;
after a CIFAR10 data set is obtained, firstly, images are classified according to class labels of the data set, and then the class labels are coded by utilizing one-hot vectors; finally, normalizing the image pixel values and saving the data as tensor so as to generate a confrontation network for use;
And 2, step: constructing a convolutional neural network;
constructing a convolutional neural network comprising two sub-networks, one of which is a generator and the other is a discriminator; the input of the generator is Gaussian noise, the output of the generator is an image, the input of the discriminator is an image, and the output of the discriminator is a scalar; the first layer of the generator network is a linear full-connection layer, then three up-sampling residual error network blocks are sequentially connected, and finally a standard convolution block is connected; the discriminator network sequentially adopts two down-sampling residual error network blocks, two standard residual error network blocks and a linear full-connection layer; the standard convolution block, the upsampled residual network block, the downsampled residual network block, and the residual network block are shown in fig. 4.
And step 3: constructing a dynamic multi-head attention module;
after Gaussian noise is sent to a generator in a convolutional neural network, a feature map obtained through output of an up-sampling residual error network block in the generator is X, and the size of the feature map is H multiplied by W multiplied by C, wherein C is the number of channels of the feature map, and H and W are the height and width of the feature map respectively; reshaping X to nxc, wherein N ═ hxw;
the first step is to calculate the dynamic attention weight of XWherein M is the number of the self-attention heads; inputting X into a grouping convolution, and obtaining a query feature map group set, a key feature map group set and a value feature map group set; thirdly, selecting corresponding optimal query feature maps, key feature maps and value feature maps from the 3 feature map group sets by using the dynamic attention weight z; fourthly, respectively carrying out dimension reduction transformation on the selected key characteristic diagram and the value characteristic diagram by using dimension reduction transformation E and dimension reduction transformation F, and reconstructing a characteristic diagram X by using the query characteristic diagram and the two characteristic diagrams after dimension reduction;
And 4, step 4: designing a total neural network;
embedding the dynamic multi-head self-attention module in the step 3 into the generator in the step 2, wherein the embedding position is behind the last up-sampling residual error network block of the generator; during training, after Gaussian noise is sent into the generator, a feature graph X is obtained through output of an up-sampling residual error network block in the generator, a reconstructed feature graph X is obtained after the feature graph X passes through the dynamic multi-head self-attention module in the step 3, an output picture is obtained after the reconstructed feature graph X passes through an output convolution layer in the generator, and the output picture of the generator is used as input of the discriminator.
And 5: designing a loss function;
recording the picture acquired in the step 1 as I; and randomly sampling the normal distribution to obtain a vectorThe generator network in the step 2 is marked as G, and the discriminator network is marked as D; the input of the generator G is v, and the output of the generator G is marked as G (v); d input of the discriminator is I and G (v), and output of the discriminator is respectively marked as D (I) and D (G (I)); the loss of the network is:
in order to be a loss function of the discriminator,a loss function for the generator;respectively representing the expectation of I and v;
step 6: training a total neural network;
performing network training by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, fixing the parameter of G when D is updated, and alternately updating once every iteration;
And 7: testing the total neural network;
training the model in step 6, and only taking a generator G; and inputting different noise samples in the normal distribution into G to obtain a plurality of different output pictures.
The specific method of the step 3 comprises the following steps:
Calculating the probability value of each self-attention head being selected, wherein M is the number of the self-attention heads; z is derived from the attention module pi,represents the ith dimensional component of the z-vector and it represents the probability that the ith self-attention head was selected; the input of the attention module pi is the feature map X, the output z i The component is output by the ith neuron of the fully-connected layer in the attention module pi;
step 3.2: calculating a feature map group set;
calculating a query feature map, a key feature map and a value feature map of each self-attention head; all query feature maps of the self-attention heads jointly form a query feature map group set, the set is represented by Q, the size is M multiplied by C multiplied by N, M is the number of the attention heads, and the meanings and numerical values of C and N are consistent with those of the feature map X; similarly, a key feature map group set and a value feature map group set are obtained, and are respectively represented by K and V, and the size of the key feature map group set is the same as that of Q; q, K and V can be obtained by inputting the feature diagram X into three different grouping convolutions; q i ,K i ,V i Respectively showing a query feature map, a key feature map and a value feature map of the ith self-attention head;
step 3.3: selection from the attention head;
selecting an optimal self-attention head according to the dynamic attention weight z; first search for the maximum component in the dynamic attention weight z, assuming the jth dimension z of z j Is the greatest, which represents the highest probability that the jth self-attention head is selected; in order to be able to propagate and calculate the gradient backwards, the dynamic attention weight z is thinned out, i.e. the component z of the jth dimension of the weight z is j Setting 1, and setting the components of other dimensions to 0; the thinned dynamic attention weight z is then weighted * Q, K, V are weighted separately, due to the weight z * Only the j-th dimension component 1 and the remaining components are 0, so the weighting process is actually to select the jth of the j' th attention head Q among Q, K, V j ,K j And V j ;Q j ,K j ,V j Respectively showing a query feature map, a key feature map and a value feature map of the jth attention head; their size is consistent with the signature X;
step 3.4: carrying out dimension reduction transformation and reconstruction on the feature map;
using dimension reduction transform to pair selected K j ,V j Performing dimensionality reduction treatment; the dimension reduction method uses common pooling operation or down-sampling convolution transformation, and the dimension reduction is processed V j * All have the same size of DxC, wherein D ═ H * ×W * ;H * And W * Respectively the height and width of the feature map after dimension reduction; first pair Qj and invertedMatrix multiplication is carried out, softmax normalization is carried out on the operation result to obtain a self-attention correlation matrix B,the size of B is NxD; then pair B and V j * Carrying out matrix multiplication to obtain a reconstruction characteristic diagram X', X ═ BV j * The size of X 'is NxC, and the size of X' is consistent with that of the original characteristic diagram X; adding X' and X as output reconstruction characteristic graph, using X * To indicate that the size is consistent with X; finally, changing X into the shape H multiplied by W multiplied by C of the original input feature diagram.
The innovation of the invention is that:
(1) aiming at the problem of high computational complexity of the existing non-local self-attention mechanism, a key feature map and a value feature map in the self-attention mechanism computation process are subjected to dimension reduction by convolution transformation and pooling operation, as shown in fig. 2.
(2) Aiming at the problems that the existing multi-head self-attention mechanism is high in calculation complexity and lacks connection and constraint among different attention heads, the idea of dynamic convolution is introduced into the calculation process of the multi-head self-attention mechanism, and a proper self-attention head is selected by using the dynamic self-attention weight obtained through calculation, as shown in fig. 3.
(3) We introduce this mechanism into the generative challenge network approach to complete the image synthesis experiment and achieve excellent results in the experiment.
(1) The improvement in (2) can further reduce the computational complexity, and enable different attention heads to obtain good connection and cooperation, and finally the result of the image synthesis experiment is improved through the combination of the two.
Drawings
FIG. 1 is a diagram of the main network structure of the method of the present invention
FIG. 2 is a schematic view of the self-attention mechanism dimension reduction of the method of the present invention
FIG. 3 is a schematic diagram of the dynamic self-attention mechanism of the method of the present invention
FIG. 4 is a diagram of a standard rolling block, a residual block, an upsampled residual block, and a downsampled block of the method of the present invention
Detailed Description
Step 1: preprocessing the data set;
a CIFAR10 dataset was obtained, the CIFAR10 dataset consisting of 10 classes of 32 × 32 natural color images and their corresponding class labels, comprising a total of 60000 images and their corresponding labels. First, the images can be classified into 10 categories according to the category labels of the dataset. The class labels are then encoded using the one-hot vector. Finally, the picture pixel values are normalized to the range [ -1, 1], and the data is saved as a tensor to be used in order to generate the countermeasure network.
Step 2: constructing a convolutional neural network;
the step of constructing the convolutional neural network comprises two sub-networks, wherein one sub-network is a generator, and the other sub-network is a discriminator; the generator inputs gaussian noise and its output is an image, while the discriminator inputs an image and its output is a scalar. The first layer of the generator network is a linear full-connection layer, then three up-sampling residual error network blocks are connected, and finally a standard convolution block is connected; the discriminator network sequentially adopts two down-sampling residual error network blocks, two standard residual error network blocks and a linear full-connection layer.
And step 3: constructing a dynamic multi-head attention module;
after Gaussian noise is sent to a generator in a convolutional neural network, a feature map obtained through output of an up-sampling residual error network block in the generator is X, and the size of the feature map is H multiplied by W multiplied by C, wherein C is the number of channels of the feature map, and H and W are the height and width of the feature map respectively; reshaping X to nxc, wherein N ═ hxw;
the first step is to calculate the dynamic attention weight of XWherein M is the number of the self-attention heads; inputting the X into a grouping convolution to obtain a query feature map group set, a key feature map group set and a value feature map group set; thirdly, selecting corresponding optimal query feature maps, key feature maps and value feature maps from the 3 feature map group sets by using the dynamic attention weight z; fourthly, respectively carrying out dimension reduction transformation on the selected key characteristic diagram and the value characteristic diagram by using dimension reduction transformation E and dimension reduction transformation F, and reconstructing a characteristic diagram X by using the query characteristic diagram and the two characteristic diagrams after dimension reduction;
And 4, step 4: designing a total neural network;
embedding the dynamic multi-head self-attention module in the step 3 into the generator in the step 2, wherein the embedding position is behind the last up-sampling residual error network block of the generator; during training, after Gaussian noise is sent to the generator, a feature map X is obtained through output of an up-sampling residual error network block in the generator, the feature map X is subjected to the dynamic multi-head self-attention module in the step 3 to obtain a reconstructed feature map X, the reconstructed feature map X is subjected to output convolution layers in the generator to obtain an output picture, and the output picture of the generator is used as input of the discriminator.
And 5: designing a loss function;
recording the picture acquired in the step 1 as I; and randomly sampling the normal distribution to obtain a vectorThe generator network in the step 2 is marked as G, and the discriminator network is marked as D; the input of the generator in G is v, and the output of the generator is marked as G (v); the inputs of the arbiter are I and G (v), and their outputs are denoted as D (I) and D (G (I)), respectively. The loss of the network can be described as:
in order to be a loss function of the discriminator,a loss function for the generator;respectively representing the expectation of I and v;
step 6: training a total neural network;
performing network training by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, and if D is updated, updating is performed alternately once in each iteration, and 200000 iteration times are adopted in actual training;
And 7: testing the total neural network;
the model is trained in step 6, taking only generator G. And inputting different noise samples in normal distribution into G to obtain a plurality of different output pictures, and testing the quality and diversity of the obtained pictures. According to an experimental result, on a test data set of CIFAR10, the Inceposition Score index of a generated picture is improved by 0.17 minute compared with the former 8.31 minute and reaches 8.48 minutes; the FID index of the generated picture is improved by 0.95 point compared with the former 12.02 point and reaches 11.07 points; the computational complexity of the model is represented by O (kn) 2 ) The improvement is O (n);
the specific method of the step 3 comprises the following steps:
Calculating the probability value of each self-attention head being selected, wherein M is the number of the self-attention heads; z is derived from the attention module pi,represents the ith dimensional component of the z-vector and it represents the probability that the ith self-attention head was selected; the input of the attention module pi is the feature map X, the output z i The component is output by the ith neuron of the fully-connected layer in the attention module pi;
step 3.2: calculating a feature map group set;
calculating a query feature map, a key feature map and a value feature map of each self-attention head; all query feature maps of the self-attention heads jointly form a query feature map group set, the set is represented by Q, the size is M multiplied by C multiplied by N, M is the number of the attention heads, and the meanings and numerical values of C and N are consistent with those of the feature map X; similarly, a key feature map group set and a value feature map group set are obtained and respectively represented by K and V, and the size of the key feature map group set is the same as that of Q; q, K and V can be obtained by inputting the feature diagram X into three different grouping convolutions; q i ,K i ,V i Respectively showing a query feature map, a key feature map and a value feature map of the ith self-attention head;
step 3.3: selection from the attention head;
selecting an optimal self-attention head according to the dynamic attention weight z; first search for the maximum component in the dynamic attention weight z, assuming the jth dimension z of z j Is the greatest, which represents the highest probability that the jth self-attention head is selected; in order to be able to propagate and calculate the gradient backwards, the dynamic attention weight z is thinned out, i.e. the component z of the jth dimension of the weight z is j Setting 1, and setting the components of other dimensions to 0; the thinned dynamic attention weight z is then weighted * Q, K, V are weighted separately, due to the weight z * Only the j-th dimension component 1 and the remaining components are 0, so the weighting process is actually to select the jth of the j' th attention head Q among Q, K, V j ,K j And V j ;Q j ,K j ,V j Respectively showing a query feature map, a key feature map and a value feature map of the jth attention head; their size is consistent with the signature X;
step 3.4: carrying out dimension reduction transformation and reconstruction on the feature map;
using dimension reduction transform to pair selected K j ,V j Performing dimensionality reduction treatment; the dimension reduction method uses common pooling operation or down-sampling convolution transformation, and the dimension reduction is processed V j * Are all of the same size as D × C, wherein D ═ H * ×W * ;H * And W * Respectively the height and width of the feature map after dimension reduction; firstly, to Q j And after inversionMatrix multiplication is carried out, softmax normalization is carried out on the operation result to obtain a self-attention correlation matrix B,the size of B is NxD;then pair B and V j * Carrying out matrix multiplication to obtain a reconstruction characteristic diagram X', X ═ BV j * The size of X 'is NxC, and the size of X' is consistent with that of the original characteristic diagram X; adding X' and X as output reconstruction characteristic graph, using X * To indicate that the size is consistent with X; finally, X is * Changing to the shape H × W × C of the original input feature map.
Claims (2)
1. An image synthesis method based on a dynamic self-attention generation countermeasure network, the method comprising:
step 1: preprocessing the data set;
after a cifar10 data set is obtained, firstly, images are classified according to class labels of the data set, and then the class labels are encoded by utilizing one-hot vectors; finally, normalizing the image pixel values and saving the data as tensor so as to generate a confrontation network for use;
step 2: constructing a convolutional neural network;
constructing a convolutional neural network comprising two sub-networks, one of which is a generator and the other is a discriminator; the input of the generator is Gaussian noise, the output of the generator is an image, the input of the discriminator is an image, and the output of the discriminator is a scalar; the first layer of the generator network is a linear full-connection layer, then three up-sampling residual error network blocks are sequentially connected, and finally a standard convolution block is connected; the discriminator network sequentially adopts two down-sampling residual error network blocks, two standard residual error network blocks and a linear full-connection layer;
And 3, step 3: constructing a dynamic multi-head self-attention module;
after Gaussian noise is sent to a generator in a convolutional neural network, a feature map obtained through output of an up-sampling residual error network block in the generator is X, and the size of the feature map is H multiplied by W multiplied by C, wherein C is the number of channels of the feature map, and H and W are the height and width of the feature map respectively; reshaping X to nxc, wherein N ═ hxw;
the first step is to calculate the dynamic attention weight of XWherein M is the number of the self-attention heads; inputting X into a grouping convolution, and obtaining a query feature map group set, a key feature map group set and a value feature map group set; thirdly, selecting corresponding optimal query feature maps, key feature maps and value feature maps from the 3 feature map group sets by using the dynamic attention weight z; fourthly, respectively carrying out dimension reduction transformation on the selected key characteristic diagram and the value characteristic diagram by using dimension reduction transformation E and dimension reduction transformation F, and reconstructing a characteristic diagram X by using the query characteristic diagram and the two characteristic diagrams after dimension reduction;
and 4, step 4: designing a total neural network;
embedding the dynamic multi-head self-attention module in the step 3 into the generator in the step 2, wherein the embedding position is behind the last up-sampling residual error network block of the generator; during training, after Gaussian noise is sent into a generator, a feature map X is obtained through output of an up-sampling residual error network block in the generator, and a reconstructed feature map X is obtained after the feature map X passes through the dynamic multi-head self-attention module in the step 3 * Reconstructing the feature map X * Obtaining an output picture through the output convolution layer in the generator, and taking the output picture of the generator as the input of the discriminator;
and 5: designing a loss function;
recording the picture acquired in the step 1 as I; and randomly sampling the normal distribution to obtain a vectorThe generator network in the step 2 is marked as G, and the discriminator network is marked as D; the input of the generator G is v, and the output of the generator G is marked as G (v); d input of the discriminator is I and G (v), and output of the discriminator is respectively marked as D (I) and D (G (I)); the loss of the network is:
in order to be a loss function of the discriminator,a loss function for the generator;respectively representing the expectation of I and v;
step 6: training a total neural network;
performing network training by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, fixing the parameter of G when D is updated, and alternately updating once every iteration;
and 7: testing the total neural network;
training the model in step 6, and only taking a generator G; and inputting different noise samples in the normal distribution into G to obtain a plurality of different output pictures.
2. The image synthesis method based on the dynamic self-attention generation countermeasure network as claimed in claim 1, wherein the specific method of step 3 is:
Calculating the probability value of each self-attention head being selected, wherein M is the number of the self-attention heads; z is derived from the attention module pi,represents the ith dimensional component of the z-vector and it represents the probability that the ith self-attention head was selected; the input of the attention module pi is the feature map X, the output z i The component is taken care ofThe output of the ith neuron of the full connection layer in the force module pi is obtained;
step 3.2: calculating a feature map group set;
calculating a query feature map, a key feature map and a value feature map of each self-attention head; all query feature maps of the self-attention heads jointly form a query feature map group set, the set is represented by Q, the size is M multiplied by C multiplied by N, M is the number of the attention heads, and the meanings and numerical values of C and N are consistent with those of the feature map X; similarly, a key feature map group set and a value feature map group set are obtained and respectively represented by K and V, and the size of the key feature map group set is the same as that of Q; q, K and V can be obtained by inputting the feature diagram X into three different grouping convolutions; q i ,K i ,V i Respectively showing a query feature map, a key feature map and a value feature map of the ith self-attention head;
step 3.3: selection from the attention head;
Selecting an optimal self-attention head according to the dynamic attention weight z; first search for the maximum component in the dynamic attention weight z, assuming the jth dimension z of z j Is the greatest, which represents the highest probability that the jth self-attention head is selected; in order to be able to propagate and calculate the gradient backwards, the dynamic attention weight z is thinned out, i.e. the component z of the jth dimension of the weight z is j Setting 1, and setting the components of other dimensions to 0; the thinned dynamic attention weight z is then weighted * Q, K, V are weighted separately, due to the weight z * Only the j-th dimension component1 and the remaining components are 0, so the weighting process is actually to select the jth of the j' th attention head Q among Q, K, V j ,K j And V j ;Q j ,K j ,V j Respectively showing a query feature map, a key feature map and a value feature map of the jth attention head; their size is consistent with the signature X;
step 3.4: carrying out dimension reduction transformation and reconstruction on the feature map;
using dimension reduction transform to pair selected K j ,V j Performing dimensionality reduction treatment; the dimension reduction method uses common pooling operation or down-sampling convolution transformation, and the dimension reduction is processedV j * Are all of the same size as D × C, wherein D ═ H * ×W * ;H * And W * Respectively the height and width of the feature map after dimension reduction; firstly, to Q j And after inversion Matrix multiplication is carried out, softmax normalization is carried out on the operation result to obtain a self-attention correlation matrix B,the size of B is NxD; then pair B and V j * Carrying out matrix multiplication to obtain a reconstruction characteristic diagram X', X ═ BV j * The size of X 'is NxC, and the size of X' is consistent with that of the original characteristic diagram X; adding X' and X as output reconstruction characteristic graph, using X * To indicate that the size is consistent with X; finally, X is * Changing to the shape H × W × C of the original input feature map.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110537516.XA CN113379655B (en) | 2021-05-18 | 2021-05-18 | Image synthesis method for generating antagonistic network based on dynamic self-attention |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110537516.XA CN113379655B (en) | 2021-05-18 | 2021-05-18 | Image synthesis method for generating antagonistic network based on dynamic self-attention |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113379655A CN113379655A (en) | 2021-09-10 |
CN113379655B true CN113379655B (en) | 2022-07-29 |
Family
ID=77571206
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110537516.XA Active CN113379655B (en) | 2021-05-18 | 2021-05-18 | Image synthesis method for generating antagonistic network based on dynamic self-attention |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113379655B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114022506B (en) * | 2021-11-16 | 2024-05-17 | 天津大学 | Image restoration method for edge prior fusion multi-head attention mechanism |
CN114494814A (en) * | 2022-01-27 | 2022-05-13 | 北京百度网讯科技有限公司 | Attention-based model training method and device and electronic equipment |
CN114758145A (en) * | 2022-03-08 | 2022-07-15 | 深圳集智数字科技有限公司 | Image desensitization method and device, electronic equipment and storage medium |
CN114677515B (en) * | 2022-04-25 | 2023-05-26 | 电子科技大学 | Weak supervision semantic segmentation method based on similarity between classes |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110969589A (en) * | 2019-12-03 | 2020-04-07 | 重庆大学 | Dynamic scene fuzzy image blind restoration method based on multi-stream attention countermeasure network |
CN111429433A (en) * | 2020-03-25 | 2020-07-17 | 北京工业大学 | Multi-exposure image fusion method based on attention generation countermeasure network |
CN111476717A (en) * | 2020-04-07 | 2020-07-31 | 西安电子科技大学 | Face image super-resolution reconstruction method based on self-attention generation countermeasure network |
CN111583210A (en) * | 2020-04-29 | 2020-08-25 | 北京小白世纪网络科技有限公司 | Automatic breast cancer image identification method based on convolutional neural network model integration |
CN111696027A (en) * | 2020-05-20 | 2020-09-22 | 电子科技大学 | Multi-modal image style migration method based on adaptive attention mechanism |
CN111798369A (en) * | 2020-06-29 | 2020-10-20 | 电子科技大学 | Face aging image synthesis method for generating confrontation network based on circulation condition |
CN112561838A (en) * | 2020-12-02 | 2021-03-26 | 西安电子科技大学 | Image enhancement method based on residual self-attention and generation countermeasure network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10997464B2 (en) * | 2018-11-09 | 2021-05-04 | Adobe Inc. | Digital image layout training using wireframe rendering within a generative adversarial network (GAN) system |
-
2021
- 2021-05-18 CN CN202110537516.XA patent/CN113379655B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110969589A (en) * | 2019-12-03 | 2020-04-07 | 重庆大学 | Dynamic scene fuzzy image blind restoration method based on multi-stream attention countermeasure network |
CN111429433A (en) * | 2020-03-25 | 2020-07-17 | 北京工业大学 | Multi-exposure image fusion method based on attention generation countermeasure network |
CN111476717A (en) * | 2020-04-07 | 2020-07-31 | 西安电子科技大学 | Face image super-resolution reconstruction method based on self-attention generation countermeasure network |
CN111583210A (en) * | 2020-04-29 | 2020-08-25 | 北京小白世纪网络科技有限公司 | Automatic breast cancer image identification method based on convolutional neural network model integration |
CN111696027A (en) * | 2020-05-20 | 2020-09-22 | 电子科技大学 | Multi-modal image style migration method based on adaptive attention mechanism |
CN111798369A (en) * | 2020-06-29 | 2020-10-20 | 电子科技大学 | Face aging image synthesis method for generating confrontation network based on circulation condition |
CN112561838A (en) * | 2020-12-02 | 2021-03-26 | 西安电子科技大学 | Image enhancement method based on residual self-attention and generation countermeasure network |
Non-Patent Citations (4)
Title |
---|
Cross-modal Feature Alignment based Hybrid Attentional Generative Adversarial Networks for text-to-image synthesis;Qingrong Cheng et al.;《Digital Signal Processing》;20200930;1-17 * |
Missing Data Repairs for Traffic Flow With Self-Attention Generative Adversarial Imputation Net;Weibin Zhang et al.;《IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS》;20210504;1-12 * |
生成对抗网络的改进及其应用研究;王博文;《中国优秀硕士学位论文全文数据库 信息科技辑》;20220115(第1期);I138-2322 * |
融合残差和对抗网络的跨模态PET图像合成方法;肖晨晨 等;《计算机工程与应用》;20210223;第58卷(第1期);218-223 * |
Also Published As
Publication number | Publication date |
---|---|
CN113379655A (en) | 2021-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113379655B (en) | Image synthesis method for generating antagonistic network based on dynamic self-attention | |
Parmar et al. | Image transformer | |
CN111798369B (en) | Face aging image synthesis method for generating confrontation network based on circulation condition | |
Pan et al. | Loss functions of generative adversarial networks (GANs): Opportunities and challenges | |
Chien et al. | Tensor-factorized neural networks | |
Boughida et al. | A novel approach for facial expression recognition based on Gabor filters and genetic algorithm | |
Zhang et al. | End-to-end photo-sketch generation via fully convolutional representation learning | |
CN111696027B (en) | Multi-modal image style migration method based on adaptive attention mechanism | |
Furukawa | SOM of SOMs | |
CN104268593A (en) | Multiple-sparse-representation face recognition method for solving small sample size problem | |
Sun et al. | Separable Markov random field model and its applications in low level vision | |
CN112818764A (en) | Low-resolution image facial expression recognition method based on feature reconstruction model | |
CN112288011A (en) | Image matching method based on self-attention deep neural network | |
Lei et al. | NITES: A non-parametric interpretable texture synthesis method | |
Du et al. | Blind image denoising via dynamic dual learning | |
CN117079098A (en) | Space small target detection method based on position coding | |
CN113989405B (en) | Image generation method based on small sample continuous learning | |
Love et al. | Topological deep learning | |
CN114037770A (en) | Discrete Fourier transform-based attention mechanism image generation method | |
CN114795178A (en) | Multi-attention neural network-based brain state decoding method | |
CN114170657A (en) | Facial emotion recognition method integrating attention mechanism and high-order feature representation | |
CN114170659A (en) | Facial emotion recognition method based on attention mechanism | |
CN116342961B (en) | Time sequence classification deep learning system based on mixed quantum neural network | |
Althbaity et al. | Colorization Of Grayscale Images Using Deep Learning | |
CN108734206B (en) | Maximum correlation principal component analysis method based on deep parameter learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |