CN113379655B - Image synthesis method for generating antagonistic network based on dynamic self-attention - Google Patents

Image synthesis method for generating antagonistic network based on dynamic self-attention Download PDF

Info

Publication number
CN113379655B
CN113379655B CN202110537516.XA CN202110537516A CN113379655B CN 113379655 B CN113379655 B CN 113379655B CN 202110537516 A CN202110537516 A CN 202110537516A CN 113379655 B CN113379655 B CN 113379655B
Authority
CN
China
Prior art keywords
attention
feature map
self
generator
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110537516.XA
Other languages
Chinese (zh)
Other versions
CN113379655A (en
Inventor
王博文
潘力立
李宏亮
孟凡满
吴庆波
许林峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202110537516.XA priority Critical patent/CN113379655B/en
Publication of CN113379655A publication Critical patent/CN113379655A/en
Application granted granted Critical
Publication of CN113379655B publication Critical patent/CN113379655B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image synthesis method based on a dynamic self-attention generation antagonistic network, and belongs to the field of computer vision. The method comprises the steps of firstly selecting a generated countermeasure network as a basic frame, normalizing a training picture, and sampling a normal distribution to obtain a noise sample. The invention uses Linformer algorithm and dynamic convolution algorithm for reference, improves the multi-head self-attention mechanism used in the invention, and increases the connection and constraint between each self-attention head, so that the self-attention heads can learn various mode knowledge of the image. The invention fully exerts the advantages of the dynamic self-attention mechanism and the generation of the confrontation network, and the proposed dynamic self-attention module can greatly reduce the calculation complexity of the multi-head self-attention mechanism and improve the problems of mode collapse, unstable training and the like of the generation of the confrontation network.

Description

Image synthesis method for generating antagonistic network based on dynamic self-attention
Technical Field
The invention belongs to the field of computer vision, and mainly relates to the problem of image synthesis; the method is mainly applied to the fields of image restoration, editing, enhancement, retrieval and the like.
Background
Image synthesis is a technique of understanding image contents using computer vision techniques and generating a specified image as needed. It can be generally divided into two types: unsupervised image synthesis and supervised image synthesis. Unsupervised image synthesis refers to learning a mapping function from a noise distribution to an image distribution, and synthesizing an image by the mapping function. Supervised image synthesis refers to learning the condition distribution of image data and further generating an image under a given condition. Image synthesis, which is a hot problem in the field of computer vision, is the basis for image restoration, editing and enhancement. The method can solve the problem of visual data loss in the fields of military affairs, medical treatment, safety and the like, and can be applied to the fields of film and television entertainment, planar design and the like.
Since human beings are sensitive to information such as details and edges of images, image synthesis algorithms need to guarantee authenticity and diversity of image synthesis. To improve the realism and diversity of composite images, many scholars use depth generation techniques to improve past image synthesis algorithms. However, when the target data distribution is very complex, early depth generation methods often face the problems of large computation and difficulty in solving. The method of creating a competing network proposed by Goodfellow et al in 2014 solved this problem excellently. Compared with the prior deep generation method, the generation of the countermeasure network has the following obvious advantages: 1. the method for generating the confrontation network can generate samples with larger dimensions only by increasing the output dimension of the generator and the input dimension of the discriminator. 2. The generation of the countermeasure network does not make any prior assumptions about the data distribution, and therefore does not require manual design of the distribution of the model. 3. The data distribution synthesized by the method for generating the countermeasure network is very close to the data distribution of a real sample, and the authenticity and diversity of the synthesized image can be well ensured. Because of these obvious advantages of creating a competing network, the present invention performs the image synthesis task using the method of creating a competing network.
At present, the existing method for generating the confrontation network still has the problems of mode collapse, unstable training and the like. To improve these problems, Goodfellow et al model long-range correlations between synthesized pixels by introducing a non-local Self-attention mechanism, and their proposed Self-attention mechanism makes a very large breakthrough in image synthesis tasks in various fields against the network (SAGAN). Reference: H.Zhang, I.Goodfellow, D.Metaxas, et al.self-authentication genetic additive networks [ C ]. International conference on machine learning,2019, 7354-7363. However, the model has the problems of high computational complexity, low computational efficiency, poor long-range correlation performance among modeling pixels and the like. The invention provides an image synthesis method for generating an antagonistic network based on dynamic self-attention by using a Linformer algorithm and a dynamic convolution algorithm on the basis of an SAGAN model, and obtains an excellent result. Reference documents: chen, x.dai, m.liu, et al.dynamic restriction: the IEEE Conference on Computer Vision and Pattern Recognition, 2020, 11030-.
Disclosure of Invention
The invention relates to an image synthesis method for generating an antagonistic network based on dynamic self-attention, which solves the problems of high computational complexity, low computational efficiency, poor long-range correlation performance among modeling pixels and the like in the conventional method for generating the antagonistic network based on a self-attention mechanism. The method comprises the steps of firstly selecting and using a generated countermeasure network as a basic frame, normalizing a training picture, and sampling a normal distribution to obtain noise. Meanwhile, the invention uses Linformer algorithm and dynamic convolution algorithm for reference, improves the multi-head self-attention mechanism used in the invention, and increases the connection and constraint between each self-attention head, so that the self-attention heads can learn various mode knowledge of the image. In the training process, the method inputs noise and pictures into the network simultaneously, and trains the model by utilizing a generated confrontation network algorithm. After training is completed, the synthesis task of the image can be completed by inputting noise in the generation network. By the method, the advantages of the dynamic self-attention mechanism and the generation of the confrontation network are fully exerted, the calculation complexity of the multi-head self-attention mechanism can be greatly reduced by the proposed dynamic self-attention module, and the problems of mode collapse, unstable training and the like of the generation of the confrontation network are solved.
For convenience in describing the present disclosure, certain terms are first defined.
Definition 1: generating a countermeasure Network (GAN) is composed of a set of antagonistic neural networks (called generator and arbiter, respectively) that choose to distribute p from a certain data z (z) as input, noise z obtained by random sampling. Then, the generator G establishes mapping between the data distribution and the target data distribution, the input of the discriminator is the real sample or the output of the generator G, and the discriminator is used for distinguishing the output of the generator from the real sample as much as possible. The output of discriminator D is a scalar D (x) which represents the probability that the input sample x is from real data rather than synthetic data. In the actual training process, an alternating training mode is generally adopted to make the arbiter and the generator advance to the optimal direction. Firstly, fixing parameters of a generator G, and training a discriminator D by maximizing a target function, so that the discrimination accuracy of the discriminator D is optimized; then, parameters of the discriminator D are fixed, and the result generated by the discriminator G is close to real data by minimizing a target function, so that the accuracy of the discriminator is reduced; the process of alternating training is then repeated, and when the result of the generator is consistent with the true data distribution, the objective function reaches a global optimal solution. The objective function of the optimization process can be expressed by the following formula:
Figure BDA0003070373070000021
In the above formula min and max represent the mathematical signs min and max, respectively, E [ ·]Representing the mathematical expectation of the distribution of the sought data, x representing the real data, p data (x) Is a true data distribution, z denotes the distribution p from a certain data z (z) randomly sampling the resulting vector.
Figure BDA0003070373070000022
Respectively represent the pairs x ∈ p data (x) And z ∈ p z (z) expectation.
Definition 2: a non-local self-attention mechanism. The non-local self-attention mechanism typically includes 3 modules, query, key, and value. The query and the key firstly carry out correlation operation, and finally carry out weighting operation with value, wherein the core operator is
Figure BDA0003070373070000031
Figure BDA0003070373070000032
i denotes the index of the output position and j denotes the index enumerating all possible positions. x represents the input image, f (-) is used to compute the correlation between the i location pixel and all possible location pixels, g (x) represents the arbitrary transformation of x, and c (x) is used to normalize the computed result.
Definition 3: a multi-head self-attention mechanism. The multi-head self-attention mechanism can directly model more complex long-range correlation between pixels, and each self-attention head can learn correlation matrixes of different modes, which plays an important role in improving the generated result. Its core operators are:
MultiHead(Q,K,V)=Concat(head 1 ,...,head h )W O
The calculation of each of the attention heads in the formula is the same as the calculation of the non-local attention mechanism in definition 2, except that the calculation results obtained by the calculation heads are spliced and then processed by the matrix W O Projected back to the original size. Since the parameters between each of the self-attention heads are not shared, they can be operated in parallel to greatly reduce the computational complexity.
Definition 3: a neural network. Neural networks typically include an input layer, an output layer, and a hidden layer. The input layer is a collection of neurons that accept a large amount of non-linear input data. The output layer is the neuron combination of the final output result. The hidden layer is a layer composed of a plurality of neurons between the input layer and the output layer.
Definition 4: a non-linear activation function. The nonlinear activation function is an indispensable basic unit in the neural network, and has the functions of enhancing the nonlinearity of the network and improving the modeling capability of the network on nonlinear data. Common activation functions include Sigmoid function, tanh function, modified linear unit ReLU.
Definition 5: the image convolution is convolved with the transpose. Image convolution and transposed convolution are commonly used in feature extraction and image synthesis, respectively, in deep learning, and can be viewed as operations in opposite directions. The convolution operation can realize functions similar to human eyes, namely extracting local features of the image, and meanwhile, the convolution operation realizes functions of parameter sharing and data dimension reduction. The transposed convolution is also called deconvolution, and the low-dimensional image features can generate a high-dimensional image through a series of transposed convolution operations, so that the transposed convolution is mostly used for image generation.
Definition 6: convolutional Neural Network (CNN). Convolutional neural networks are typically composed of one or more convolutional layers together with a top fully-connected layer, and often also contain pooling layers. Compared with other depth models, the convolutional neural network can obtain better results in the field of image and voice recognition.
Definition 7: residual Neural Network (Residual Neural Network). Compared with the traditional convolutional neural network, the residual error network adds a short connection mode which is proved to exceed the traditional straight-through convolutional neural network in efficiency and accuracy. When the network is trained, the residual error network module has obvious advantages, and the gradient which is propagated reversely can be directly propagated from the high layer to the bottom layer when passing through the residual error network module, so that the network can select which modules are to be adjusted, and the network module can be kept stable during training.
Definition 8: a normal distribution. Also known as gaussian distribution, is a very common continuous probability distribution. Normal distributions are statistically significant and are often used in natural sciences and engineering to represent an unknown random variable. If the random variable x, its probability density function satisfies
Figure BDA0003070373070000041
Where μ is the mathematical expectation of a normal distribution, σ 2 The variance of a normal distribution is said to satisfy the normal distribution, and is often referred to as
Figure BDA0003070373070000043
Definition 9: softmax function. The softmax function can compress a K-dimensional vector x containing arbitrary real numbers into another K-dimensional real vector softmax (x) such that each element ranges between (0, 1) and the sum of all elements is 1. The formula can be expressed as:
Figure BDA0003070373070000042
definition 10: and (5) one-hot coding. Because the computer can not understand the non-binary data, the one-hot code can convert the class label data into a uniform binary digital format, so that the processing and calculation of a machine learning algorithm are facilitated. The image tag in the invention is converted into a one-hot vector with fixed dimension by using the coding method. Most of numbers in the one-hot vector data are 0, and the use amount of a computer memory can be saved by using the sparse data structure.
Therefore, the technical scheme of the invention is an image synthesis method for generating an antagonistic network based on dynamic self-attention, which comprises the following steps:
step 1: preprocessing the data set;
after a CIFAR10 data set is obtained, firstly, images are classified according to class labels of the data set, and then the class labels are coded by utilizing one-hot vectors; finally, normalizing the image pixel values and saving the data as tensor so as to generate a confrontation network for use;
And 2, step: constructing a convolutional neural network;
constructing a convolutional neural network comprising two sub-networks, one of which is a generator and the other is a discriminator; the input of the generator is Gaussian noise, the output of the generator is an image, the input of the discriminator is an image, and the output of the discriminator is a scalar; the first layer of the generator network is a linear full-connection layer, then three up-sampling residual error network blocks are sequentially connected, and finally a standard convolution block is connected; the discriminator network sequentially adopts two down-sampling residual error network blocks, two standard residual error network blocks and a linear full-connection layer; the standard convolution block, the upsampled residual network block, the downsampled residual network block, and the residual network block are shown in fig. 4.
And step 3: constructing a dynamic multi-head attention module;
after Gaussian noise is sent to a generator in a convolutional neural network, a feature map obtained through output of an up-sampling residual error network block in the generator is X, and the size of the feature map is H multiplied by W multiplied by C, wherein C is the number of channels of the feature map, and H and W are the height and width of the feature map respectively; reshaping X to nxc, wherein N ═ hxw;
the first step is to calculate the dynamic attention weight of X
Figure BDA0003070373070000058
Wherein M is the number of the self-attention heads; inputting X into a grouping convolution, and obtaining a query feature map group set, a key feature map group set and a value feature map group set; thirdly, selecting corresponding optimal query feature maps, key feature maps and value feature maps from the 3 feature map group sets by using the dynamic attention weight z; fourthly, respectively carrying out dimension reduction transformation on the selected key characteristic diagram and the value characteristic diagram by using dimension reduction transformation E and dimension reduction transformation F, and reconstructing a characteristic diagram X by using the query characteristic diagram and the two characteristic diagrams after dimension reduction;
And 4, step 4: designing a total neural network;
embedding the dynamic multi-head self-attention module in the step 3 into the generator in the step 2, wherein the embedding position is behind the last up-sampling residual error network block of the generator; during training, after Gaussian noise is sent into the generator, a feature graph X is obtained through output of an up-sampling residual error network block in the generator, a reconstructed feature graph X is obtained after the feature graph X passes through the dynamic multi-head self-attention module in the step 3, an output picture is obtained after the reconstructed feature graph X passes through an output convolution layer in the generator, and the output picture of the generator is used as input of the discriminator.
And 5: designing a loss function;
recording the picture acquired in the step 1 as I; and randomly sampling the normal distribution to obtain a vector
Figure BDA0003070373070000051
The generator network in the step 2 is marked as G, and the discriminator network is marked as D; the input of the generator G is v, and the output of the generator G is marked as G (v); d input of the discriminator is I and G (v), and output of the discriminator is respectively marked as D (I) and D (G (I)); the loss of the network is:
Figure BDA0003070373070000052
Figure BDA0003070373070000053
Figure BDA0003070373070000054
in order to be a loss function of the discriminator,
Figure BDA0003070373070000055
a loss function for the generator;
Figure BDA0003070373070000056
respectively representing the expectation of I and v;
step 6: training a total neural network;
performing network training by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, fixing the parameter of G when D is updated, and alternately updating once every iteration;
And 7: testing the total neural network;
training the model in step 6, and only taking a generator G; and inputting different noise samples in the normal distribution into G to obtain a plurality of different output pictures.
The specific method of the step 3 comprises the following steps:
step 3.1: calculating dynamic attention weights
Figure BDA0003070373070000057
Calculating the probability value of each self-attention head being selected, wherein M is the number of the self-attention heads; z is derived from the attention module pi,
Figure BDA0003070373070000066
represents the ith dimensional component of the z-vector and it represents the probability that the ith self-attention head was selected; the input of the attention module pi is the feature map X, the output z i The component is output by the ith neuron of the fully-connected layer in the attention module pi;
step 3.2: calculating a feature map group set;
calculating a query feature map, a key feature map and a value feature map of each self-attention head; all query feature maps of the self-attention heads jointly form a query feature map group set, the set is represented by Q, the size is M multiplied by C multiplied by N, M is the number of the attention heads, and the meanings and numerical values of C and N are consistent with those of the feature map X; similarly, a key feature map group set and a value feature map group set are obtained, and are respectively represented by K and V, and the size of the key feature map group set is the same as that of Q; q, K and V can be obtained by inputting the feature diagram X into three different grouping convolutions; q i ,K i ,V i Respectively showing a query feature map, a key feature map and a value feature map of the ith self-attention head;
step 3.3: selection from the attention head;
selecting an optimal self-attention head according to the dynamic attention weight z; first search for the maximum component in the dynamic attention weight z, assuming the jth dimension z of z j Is the greatest, which represents the highest probability that the jth self-attention head is selected; in order to be able to propagate and calculate the gradient backwards, the dynamic attention weight z is thinned out, i.e. the component z of the jth dimension of the weight z is j Setting 1, and setting the components of other dimensions to 0; the thinned dynamic attention weight z is then weighted * Q, K, V are weighted separately, due to the weight z * Only the j-th dimension component
Figure BDA0003070373070000067
Figure BDA0003070373070000067
1 and the remaining components are 0, so the weighting process is actually to select the jth of the j' th attention head Q among Q, K, V j ,K j And V j ;Q j ,K j ,V j Respectively showing a query feature map, a key feature map and a value feature map of the jth attention head; their size is consistent with the signature X;
step 3.4: carrying out dimension reduction transformation and reconstruction on the feature map;
using dimension reduction transform to pair selected K j ,V j Performing dimensionality reduction treatment; the dimension reduction method uses common pooling operation or down-sampling convolution transformation, and the dimension reduction is processed
Figure BDA0003070373070000061
V j * All have the same size of DxC, wherein D ═ H * ×W * ;H * And W * Respectively the height and width of the feature map after dimension reduction; first pair Qj and inverted
Figure BDA0003070373070000062
Matrix multiplication is carried out, softmax normalization is carried out on the operation result to obtain a self-attention correlation matrix B,
Figure BDA0003070373070000063
the size of B is NxD; then pair B and V j * Carrying out matrix multiplication to obtain a reconstruction characteristic diagram X', X ═ BV j * The size of X 'is NxC, and the size of X' is consistent with that of the original characteristic diagram X; adding X' and X as output reconstruction characteristic graph, using X * To indicate that the size is consistent with X; finally, changing X into the shape H multiplied by W multiplied by C of the original input feature diagram.
The innovation of the invention is that:
(1) aiming at the problem of high computational complexity of the existing non-local self-attention mechanism, a key feature map and a value feature map in the self-attention mechanism computation process are subjected to dimension reduction by convolution transformation and pooling operation, as shown in fig. 2.
(2) Aiming at the problems that the existing multi-head self-attention mechanism is high in calculation complexity and lacks connection and constraint among different attention heads, the idea of dynamic convolution is introduced into the calculation process of the multi-head self-attention mechanism, and a proper self-attention head is selected by using the dynamic self-attention weight obtained through calculation, as shown in fig. 3.
(3) We introduce this mechanism into the generative challenge network approach to complete the image synthesis experiment and achieve excellent results in the experiment.
(1) The improvement in (2) can further reduce the computational complexity, and enable different attention heads to obtain good connection and cooperation, and finally the result of the image synthesis experiment is improved through the combination of the two.
Drawings
FIG. 1 is a diagram of the main network structure of the method of the present invention
FIG. 2 is a schematic view of the self-attention mechanism dimension reduction of the method of the present invention
FIG. 3 is a schematic diagram of the dynamic self-attention mechanism of the method of the present invention
FIG. 4 is a diagram of a standard rolling block, a residual block, an upsampled residual block, and a downsampled block of the method of the present invention
Detailed Description
Step 1: preprocessing the data set;
a CIFAR10 dataset was obtained, the CIFAR10 dataset consisting of 10 classes of 32 × 32 natural color images and their corresponding class labels, comprising a total of 60000 images and their corresponding labels. First, the images can be classified into 10 categories according to the category labels of the dataset. The class labels are then encoded using the one-hot vector. Finally, the picture pixel values are normalized to the range [ -1, 1], and the data is saved as a tensor to be used in order to generate the countermeasure network.
Step 2: constructing a convolutional neural network;
the step of constructing the convolutional neural network comprises two sub-networks, wherein one sub-network is a generator, and the other sub-network is a discriminator; the generator inputs gaussian noise and its output is an image, while the discriminator inputs an image and its output is a scalar. The first layer of the generator network is a linear full-connection layer, then three up-sampling residual error network blocks are connected, and finally a standard convolution block is connected; the discriminator network sequentially adopts two down-sampling residual error network blocks, two standard residual error network blocks and a linear full-connection layer.
And step 3: constructing a dynamic multi-head attention module;
after Gaussian noise is sent to a generator in a convolutional neural network, a feature map obtained through output of an up-sampling residual error network block in the generator is X, and the size of the feature map is H multiplied by W multiplied by C, wherein C is the number of channels of the feature map, and H and W are the height and width of the feature map respectively; reshaping X to nxc, wherein N ═ hxw;
the first step is to calculate the dynamic attention weight of X
Figure BDA0003070373070000081
Wherein M is the number of the self-attention heads; inputting the X into a grouping convolution to obtain a query feature map group set, a key feature map group set and a value feature map group set; thirdly, selecting corresponding optimal query feature maps, key feature maps and value feature maps from the 3 feature map group sets by using the dynamic attention weight z; fourthly, respectively carrying out dimension reduction transformation on the selected key characteristic diagram and the value characteristic diagram by using dimension reduction transformation E and dimension reduction transformation F, and reconstructing a characteristic diagram X by using the query characteristic diagram and the two characteristic diagrams after dimension reduction;
And 4, step 4: designing a total neural network;
embedding the dynamic multi-head self-attention module in the step 3 into the generator in the step 2, wherein the embedding position is behind the last up-sampling residual error network block of the generator; during training, after Gaussian noise is sent to the generator, a feature map X is obtained through output of an up-sampling residual error network block in the generator, the feature map X is subjected to the dynamic multi-head self-attention module in the step 3 to obtain a reconstructed feature map X, the reconstructed feature map X is subjected to output convolution layers in the generator to obtain an output picture, and the output picture of the generator is used as input of the discriminator.
And 5: designing a loss function;
recording the picture acquired in the step 1 as I; and randomly sampling the normal distribution to obtain a vector
Figure BDA0003070373070000082
The generator network in the step 2 is marked as G, and the discriminator network is marked as D; the input of the generator in G is v, and the output of the generator is marked as G (v); the inputs of the arbiter are I and G (v), and their outputs are denoted as D (I) and D (G (I)), respectively. The loss of the network can be described as:
Figure BDA0003070373070000083
Figure BDA0003070373070000084
Figure BDA0003070373070000085
in order to be a loss function of the discriminator,
Figure BDA0003070373070000086
a loss function for the generator;
Figure BDA0003070373070000087
respectively representing the expectation of I and v;
step 6: training a total neural network;
performing network training by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, and if D is updated, updating is performed alternately once in each iteration, and 200000 iteration times are adopted in actual training;
And 7: testing the total neural network;
the model is trained in step 6, taking only generator G. And inputting different noise samples in normal distribution into G to obtain a plurality of different output pictures, and testing the quality and diversity of the obtained pictures. According to an experimental result, on a test data set of CIFAR10, the Inceposition Score index of a generated picture is improved by 0.17 minute compared with the former 8.31 minute and reaches 8.48 minutes; the FID index of the generated picture is improved by 0.95 point compared with the former 12.02 point and reaches 11.07 points; the computational complexity of the model is represented by O (kn) 2 ) The improvement is O (n);
the specific method of the step 3 comprises the following steps:
step 3.1: calculating dynamic attention weights
Figure BDA0003070373070000091
Calculating the probability value of each self-attention head being selected, wherein M is the number of the self-attention heads; z is derived from the attention module pi,
Figure BDA0003070373070000096
represents the ith dimensional component of the z-vector and it represents the probability that the ith self-attention head was selected; the input of the attention module pi is the feature map X, the output z i The component is output by the ith neuron of the fully-connected layer in the attention module pi;
step 3.2: calculating a feature map group set;
calculating a query feature map, a key feature map and a value feature map of each self-attention head; all query feature maps of the self-attention heads jointly form a query feature map group set, the set is represented by Q, the size is M multiplied by C multiplied by N, M is the number of the attention heads, and the meanings and numerical values of C and N are consistent with those of the feature map X; similarly, a key feature map group set and a value feature map group set are obtained and respectively represented by K and V, and the size of the key feature map group set is the same as that of Q; q, K and V can be obtained by inputting the feature diagram X into three different grouping convolutions; q i ,K i ,V i Respectively showing a query feature map, a key feature map and a value feature map of the ith self-attention head;
step 3.3: selection from the attention head;
selecting an optimal self-attention head according to the dynamic attention weight z; first search for the maximum component in the dynamic attention weight z, assuming the jth dimension z of z j Is the greatest, which represents the highest probability that the jth self-attention head is selected; in order to be able to propagate and calculate the gradient backwards, the dynamic attention weight z is thinned out, i.e. the component z of the jth dimension of the weight z is j Setting 1, and setting the components of other dimensions to 0; the thinned dynamic attention weight z is then weighted * Q, K, V are weighted separately, due to the weight z * Only the j-th dimension component
Figure BDA0003070373070000092
Figure BDA0003070373070000092
1 and the remaining components are 0, so the weighting process is actually to select the jth of the j' th attention head Q among Q, K, V j ,K j And V j ;Q j ,K j ,V j Respectively showing a query feature map, a key feature map and a value feature map of the jth attention head; their size is consistent with the signature X;
step 3.4: carrying out dimension reduction transformation and reconstruction on the feature map;
using dimension reduction transform to pair selected K j ,V j Performing dimensionality reduction treatment; the dimension reduction method uses common pooling operation or down-sampling convolution transformation, and the dimension reduction is processed
Figure BDA0003070373070000093
V j * Are all of the same size as D × C, wherein D ═ H * ×W * ;H * And W * Respectively the height and width of the feature map after dimension reduction; firstly, to Q j And after inversion
Figure BDA0003070373070000094
Matrix multiplication is carried out, softmax normalization is carried out on the operation result to obtain a self-attention correlation matrix B,
Figure BDA0003070373070000095
the size of B is NxD;then pair B and V j * Carrying out matrix multiplication to obtain a reconstruction characteristic diagram X', X ═ BV j * The size of X 'is NxC, and the size of X' is consistent with that of the original characteristic diagram X; adding X' and X as output reconstruction characteristic graph, using X * To indicate that the size is consistent with X; finally, X is * Changing to the shape H × W × C of the original input feature map.

Claims (2)

1. An image synthesis method based on a dynamic self-attention generation countermeasure network, the method comprising:
step 1: preprocessing the data set;
after a cifar10 data set is obtained, firstly, images are classified according to class labels of the data set, and then the class labels are encoded by utilizing one-hot vectors; finally, normalizing the image pixel values and saving the data as tensor so as to generate a confrontation network for use;
step 2: constructing a convolutional neural network;
constructing a convolutional neural network comprising two sub-networks, one of which is a generator and the other is a discriminator; the input of the generator is Gaussian noise, the output of the generator is an image, the input of the discriminator is an image, and the output of the discriminator is a scalar; the first layer of the generator network is a linear full-connection layer, then three up-sampling residual error network blocks are sequentially connected, and finally a standard convolution block is connected; the discriminator network sequentially adopts two down-sampling residual error network blocks, two standard residual error network blocks and a linear full-connection layer;
And 3, step 3: constructing a dynamic multi-head self-attention module;
after Gaussian noise is sent to a generator in a convolutional neural network, a feature map obtained through output of an up-sampling residual error network block in the generator is X, and the size of the feature map is H multiplied by W multiplied by C, wherein C is the number of channels of the feature map, and H and W are the height and width of the feature map respectively; reshaping X to nxc, wherein N ═ hxw;
the first step is to calculate the dynamic attention weight of X
Figure FDA0003620813780000011
Wherein M is the number of the self-attention heads; inputting X into a grouping convolution, and obtaining a query feature map group set, a key feature map group set and a value feature map group set; thirdly, selecting corresponding optimal query feature maps, key feature maps and value feature maps from the 3 feature map group sets by using the dynamic attention weight z; fourthly, respectively carrying out dimension reduction transformation on the selected key characteristic diagram and the value characteristic diagram by using dimension reduction transformation E and dimension reduction transformation F, and reconstructing a characteristic diagram X by using the query characteristic diagram and the two characteristic diagrams after dimension reduction;
and 4, step 4: designing a total neural network;
embedding the dynamic multi-head self-attention module in the step 3 into the generator in the step 2, wherein the embedding position is behind the last up-sampling residual error network block of the generator; during training, after Gaussian noise is sent into a generator, a feature map X is obtained through output of an up-sampling residual error network block in the generator, and a reconstructed feature map X is obtained after the feature map X passes through the dynamic multi-head self-attention module in the step 3 * Reconstructing the feature map X * Obtaining an output picture through the output convolution layer in the generator, and taking the output picture of the generator as the input of the discriminator;
and 5: designing a loss function;
recording the picture acquired in the step 1 as I; and randomly sampling the normal distribution to obtain a vector
Figure FDA0003620813780000012
The generator network in the step 2 is marked as G, and the discriminator network is marked as D; the input of the generator G is v, and the output of the generator G is marked as G (v); d input of the discriminator is I and G (v), and output of the discriminator is respectively marked as D (I) and D (G (I)); the loss of the network is:
Figure FDA0003620813780000021
Figure FDA0003620813780000022
Figure FDA0003620813780000023
in order to be a loss function of the discriminator,
Figure FDA0003620813780000024
a loss function for the generator;
Figure FDA0003620813780000025
respectively representing the expectation of I and v;
step 6: training a total neural network;
performing network training by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, fixing the parameter of G when D is updated, and alternately updating once every iteration;
and 7: testing the total neural network;
training the model in step 6, and only taking a generator G; and inputting different noise samples in the normal distribution into G to obtain a plurality of different output pictures.
2. The image synthesis method based on the dynamic self-attention generation countermeasure network as claimed in claim 1, wherein the specific method of step 3 is:
Step 3.1: calculating dynamic attention weights
Figure FDA0003620813780000026
Calculating the probability value of each self-attention head being selected, wherein M is the number of the self-attention heads; z is derived from the attention module pi,
Figure FDA0003620813780000027
represents the ith dimensional component of the z-vector and it represents the probability that the ith self-attention head was selected; the input of the attention module pi is the feature map X, the output z i The component is taken care ofThe output of the ith neuron of the full connection layer in the force module pi is obtained;
step 3.2: calculating a feature map group set;
calculating a query feature map, a key feature map and a value feature map of each self-attention head; all query feature maps of the self-attention heads jointly form a query feature map group set, the set is represented by Q, the size is M multiplied by C multiplied by N, M is the number of the attention heads, and the meanings and numerical values of C and N are consistent with those of the feature map X; similarly, a key feature map group set and a value feature map group set are obtained and respectively represented by K and V, and the size of the key feature map group set is the same as that of Q; q, K and V can be obtained by inputting the feature diagram X into three different grouping convolutions; q i ,K i ,V i Respectively showing a query feature map, a key feature map and a value feature map of the ith self-attention head;
step 3.3: selection from the attention head;
Selecting an optimal self-attention head according to the dynamic attention weight z; first search for the maximum component in the dynamic attention weight z, assuming the jth dimension z of z j Is the greatest, which represents the highest probability that the jth self-attention head is selected; in order to be able to propagate and calculate the gradient backwards, the dynamic attention weight z is thinned out, i.e. the component z of the jth dimension of the weight z is j Setting 1, and setting the components of other dimensions to 0; the thinned dynamic attention weight z is then weighted * Q, K, V are weighted separately, due to the weight z * Only the j-th dimension component
Figure FDA0003620813780000031
1 and the remaining components are 0, so the weighting process is actually to select the jth of the j' th attention head Q among Q, K, V j ,K j And V j ;Q j ,K j ,V j Respectively showing a query feature map, a key feature map and a value feature map of the jth attention head; their size is consistent with the signature X;
step 3.4: carrying out dimension reduction transformation and reconstruction on the feature map;
using dimension reduction transform to pair selected K j ,V j Performing dimensionality reduction treatment; the dimension reduction method uses common pooling operation or down-sampling convolution transformation, and the dimension reduction is processed
Figure FDA0003620813780000032
V j * Are all of the same size as D × C, wherein D ═ H * ×W * ;H * And W * Respectively the height and width of the feature map after dimension reduction; firstly, to Q j And after inversion
Figure FDA0003620813780000033
Matrix multiplication is carried out, softmax normalization is carried out on the operation result to obtain a self-attention correlation matrix B,
Figure FDA0003620813780000034
the size of B is NxD; then pair B and V j * Carrying out matrix multiplication to obtain a reconstruction characteristic diagram X', X ═ BV j * The size of X 'is NxC, and the size of X' is consistent with that of the original characteristic diagram X; adding X' and X as output reconstruction characteristic graph, using X * To indicate that the size is consistent with X; finally, X is * Changing to the shape H × W × C of the original input feature map.
CN202110537516.XA 2021-05-18 2021-05-18 Image synthesis method for generating antagonistic network based on dynamic self-attention Active CN113379655B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110537516.XA CN113379655B (en) 2021-05-18 2021-05-18 Image synthesis method for generating antagonistic network based on dynamic self-attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110537516.XA CN113379655B (en) 2021-05-18 2021-05-18 Image synthesis method for generating antagonistic network based on dynamic self-attention

Publications (2)

Publication Number Publication Date
CN113379655A CN113379655A (en) 2021-09-10
CN113379655B true CN113379655B (en) 2022-07-29

Family

ID=77571206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110537516.XA Active CN113379655B (en) 2021-05-18 2021-05-18 Image synthesis method for generating antagonistic network based on dynamic self-attention

Country Status (1)

Country Link
CN (1) CN113379655B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114022506B (en) * 2021-11-16 2024-05-17 天津大学 Image restoration method for edge prior fusion multi-head attention mechanism
CN114494814A (en) * 2022-01-27 2022-05-13 北京百度网讯科技有限公司 Attention-based model training method and device and electronic equipment
CN114758145A (en) * 2022-03-08 2022-07-15 深圳集智数字科技有限公司 Image desensitization method and device, electronic equipment and storage medium
CN114677515B (en) * 2022-04-25 2023-05-26 电子科技大学 Weak supervision semantic segmentation method based on similarity between classes

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969589A (en) * 2019-12-03 2020-04-07 重庆大学 Dynamic scene fuzzy image blind restoration method based on multi-stream attention countermeasure network
CN111429433A (en) * 2020-03-25 2020-07-17 北京工业大学 Multi-exposure image fusion method based on attention generation countermeasure network
CN111476717A (en) * 2020-04-07 2020-07-31 西安电子科技大学 Face image super-resolution reconstruction method based on self-attention generation countermeasure network
CN111583210A (en) * 2020-04-29 2020-08-25 北京小白世纪网络科技有限公司 Automatic breast cancer image identification method based on convolutional neural network model integration
CN111696027A (en) * 2020-05-20 2020-09-22 电子科技大学 Multi-modal image style migration method based on adaptive attention mechanism
CN111798369A (en) * 2020-06-29 2020-10-20 电子科技大学 Face aging image synthesis method for generating confrontation network based on circulation condition
CN112561838A (en) * 2020-12-02 2021-03-26 西安电子科技大学 Image enhancement method based on residual self-attention and generation countermeasure network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10997464B2 (en) * 2018-11-09 2021-05-04 Adobe Inc. Digital image layout training using wireframe rendering within a generative adversarial network (GAN) system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969589A (en) * 2019-12-03 2020-04-07 重庆大学 Dynamic scene fuzzy image blind restoration method based on multi-stream attention countermeasure network
CN111429433A (en) * 2020-03-25 2020-07-17 北京工业大学 Multi-exposure image fusion method based on attention generation countermeasure network
CN111476717A (en) * 2020-04-07 2020-07-31 西安电子科技大学 Face image super-resolution reconstruction method based on self-attention generation countermeasure network
CN111583210A (en) * 2020-04-29 2020-08-25 北京小白世纪网络科技有限公司 Automatic breast cancer image identification method based on convolutional neural network model integration
CN111696027A (en) * 2020-05-20 2020-09-22 电子科技大学 Multi-modal image style migration method based on adaptive attention mechanism
CN111798369A (en) * 2020-06-29 2020-10-20 电子科技大学 Face aging image synthesis method for generating confrontation network based on circulation condition
CN112561838A (en) * 2020-12-02 2021-03-26 西安电子科技大学 Image enhancement method based on residual self-attention and generation countermeasure network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Cross-modal Feature Alignment based Hybrid Attentional Generative Adversarial Networks for text-to-image synthesis;Qingrong Cheng et al.;《Digital Signal Processing》;20200930;1-17 *
Missing Data Repairs for Traffic Flow With Self-Attention Generative Adversarial Imputation Net;Weibin Zhang et al.;《IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS》;20210504;1-12 *
生成对抗网络的改进及其应用研究;王博文;《中国优秀硕士学位论文全文数据库 信息科技辑》;20220115(第1期);I138-2322 *
融合残差和对抗网络的跨模态PET图像合成方法;肖晨晨 等;《计算机工程与应用》;20210223;第58卷(第1期);218-223 *

Also Published As

Publication number Publication date
CN113379655A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN113379655B (en) Image synthesis method for generating antagonistic network based on dynamic self-attention
Parmar et al. Image transformer
CN111798369B (en) Face aging image synthesis method for generating confrontation network based on circulation condition
Pan et al. Loss functions of generative adversarial networks (GANs): Opportunities and challenges
Chien et al. Tensor-factorized neural networks
Boughida et al. A novel approach for facial expression recognition based on Gabor filters and genetic algorithm
Zhang et al. End-to-end photo-sketch generation via fully convolutional representation learning
CN111696027B (en) Multi-modal image style migration method based on adaptive attention mechanism
Furukawa SOM of SOMs
CN104268593A (en) Multiple-sparse-representation face recognition method for solving small sample size problem
Sun et al. Separable Markov random field model and its applications in low level vision
CN112818764A (en) Low-resolution image facial expression recognition method based on feature reconstruction model
CN112288011A (en) Image matching method based on self-attention deep neural network
Lei et al. NITES: A non-parametric interpretable texture synthesis method
Du et al. Blind image denoising via dynamic dual learning
CN117079098A (en) Space small target detection method based on position coding
CN113989405B (en) Image generation method based on small sample continuous learning
Love et al. Topological deep learning
CN114037770A (en) Discrete Fourier transform-based attention mechanism image generation method
CN114795178A (en) Multi-attention neural network-based brain state decoding method
CN114170657A (en) Facial emotion recognition method integrating attention mechanism and high-order feature representation
CN114170659A (en) Facial emotion recognition method based on attention mechanism
CN116342961B (en) Time sequence classification deep learning system based on mixed quantum neural network
Althbaity et al. Colorization Of Grayscale Images Using Deep Learning
CN108734206B (en) Maximum correlation principal component analysis method based on deep parameter learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant