CN111696027A - Multi-modal image style migration method based on adaptive attention mechanism - Google Patents
Multi-modal image style migration method based on adaptive attention mechanism Download PDFInfo
- Publication number
- CN111696027A CN111696027A CN202010431594.7A CN202010431594A CN111696027A CN 111696027 A CN111696027 A CN 111696027A CN 202010431594 A CN202010431594 A CN 202010431594A CN 111696027 A CN111696027 A CN 111696027A
- Authority
- CN
- China
- Prior art keywords
- network
- generator
- output
- picture
- discriminator
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000003044 adaptive effect Effects 0.000 title claims abstract description 27
- 230000007246 mechanism Effects 0.000 title claims abstract description 26
- 230000005012 migration Effects 0.000 title claims abstract description 24
- 238000013508 migration Methods 0.000 title claims abstract description 24
- 238000012549 training Methods 0.000 claims abstract description 14
- 230000006870 function Effects 0.000 claims description 44
- 239000013598 vector Substances 0.000 claims description 29
- 238000009826 distribution Methods 0.000 claims description 23
- 238000013528 artificial neural network Methods 0.000 claims description 22
- 238000005070 sampling Methods 0.000 claims description 18
- 238000010586 diagram Methods 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 5
- 239000000203 mixture Substances 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 9
- 230000009466 transformation Effects 0.000 abstract description 4
- 230000008901 benefit Effects 0.000 abstract description 3
- 238000005457 optimization Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/04—Context-preserving transformations, e.g. by using an importance map
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-modal image style migration method based on an adaptive attention mechanism, and belongs to the field of computer vision. The method comprises the steps of firstly selecting and using a generated countermeasure network as a basic frame, meanwhile, using the idea of EM (effective magnetic field) attention mechanism algorithm and channel scale transformation as reference, improving the EM attention mechanism on a channel domain, enabling the network to increase attention to style characteristics, weighting bases in an attention module by using noise, enabling the bases to be changed adaptively, and finally enabling the style to be changed. At this time, noise and pictures are simultaneously input into the network, and a confrontation training algorithm is used for generating a confrontation network. After the network is trained, multi-modal style migration can be performed by changing the noise. By the method, the advantages of an EM (effective man) attention mechanism and generation of a countermeasure network are fully utilized, the adaptive channel domain EM attention module is provided, and the image quality and the image diversity of the existing method after style migration are improved.
Description
Technical Field
The invention belongs to the field of computer vision, and mainly relates to a multi-modal image style migration problem; the method is mainly applied to the aspects of film and television entertainment industry, man-machine interaction, machine vision understanding and the like.
Background
The image style migration refers to a technology for converting the style of a picture into other different styles under the condition of keeping the picture content after analyzing the pictures of different styles through a computer technology. The image style migration is increasingly demanded in the fields of film and television entertainment industry, human-computer interaction, machine vision understanding and the like. For example: the character head portrait can be changed into a cartoon character head portrait in real time through the camera; in the automatic driving, the conversion of the picture into the divided picture or the like may be assisted with style migration. The existing image style migration methods are mainly divided into methods based on image optimization and model optimization.
The style migration method based on the picture optimization is a method with an early appearance time and a stable appearance time, and the basic principle of the method can be divided into three steps. The method comprises the first step of selecting a neural network capable of extracting picture features, the second step of utilizing the neural network to extract features of an original picture and a target picture and utilizing the features to design a loss function, and the third step of utilizing the loss function to conduct derivation operation on the original picture and continuously optimizing iteration to enable the style of the original picture to be close to that of the target picture. This type of method does not require a large amount of data, and is therefore simple and convenient to operate; but has the disadvantage that the iteration time is too long to convert the picture in real time. Reference documents: l.a.gamys, a.s.ecker, m.bethg.image style using a volumetric neural network, IEEE Conference on computer vision and Pattern Recognition,2016, pp.2414-2423.
The model iteration-based method mainly trains a model through a large number of pictures with different styles, so that the model learns a mapping function between one picture style and other picture styles, and then inputs one style picture into the trained model, so that pictures with different styles and consistent contents can be obtained from the output of the model. The method has the advantages that iteration steps are not needed after the model is trained, the image style migration can be carried out in real time, and the method can achieve the purpose of inputting one type of pictures and simultaneously outputting a plurality of pictures with different types of styles by inputting additional variables. However, there are disadvantages in that the style class which is not available in training cannot be transferred well in testing, and in that the sample diversity is still insufficient in multi-modal style transfer. Reference documents: yazeed, S.Neil, W.Peter.Latent filtering for multimodal unsupervised image-to-image transformation.2019, pp.1458-1466
In recent years, model optimization-based methods have become more sophisticated, and the demand for multi-modal style migration has increased. In the multi-modal style migration, the diversity and the picture quality of the current method are still insufficient. The multi-modal style migration means that given an input picture, different style pictures can be output at the same time, as shown in fig. 2, the first is the input picture, and the other is multi-style pictures output at the same time. Aiming at the field and considering the defects, the invention provides a multi-modal image style migration method based on an adaptive attention mechanism, and excellent results are obtained.
Disclosure of Invention
The invention discloses a multi-mode style migration method of an adaptive channel domain EM attention mechanism, and solves the problem of lacking in style diversity in the prior art.
The method first chooses to use the generative confrontation network as a basic framework, normalizes the training picture to 256 × 3, and samples the normal distribution to obtain noise. Meanwhile, the idea of an EM (effective electromagnetic force) attention mechanism algorithm and channel scale transformation is used for reference, the EM attention mechanism is improved in a channel domain, the attention degree of a network on style characteristics is increased, the noise is used for weighting the basis in the attention module, the basis can be changed adaptively, and finally the style is changed. At this time, noise and pictures are simultaneously input into the network, and a confrontation training algorithm is used for generating a confrontation network. After the network is trained, multi-modal style migration can be performed by changing the noise. By the method, the advantages of an EM (effective man) attention mechanism and generation of a countermeasure network are fully utilized, the adaptive channel domain EM attention module is provided, and the image quality and the image diversity of the existing method after style migration are improved. The general structure of the algorithm is schematically shown in fig. 1.
For convenience in describing the present disclosure, certain terms are first defined.
Definition 1: a normal distribution.Also called normal distribution, also known as gaussian distribution, is a probability distribution that is very important in the fields of mathematics, physics, engineering, etc., and has a significant influence on many aspects of statistics. If the random variable x, its probability density function satisfiesWhere μ is the mathematical expectation of a normal distribution, σ2The variance of a normal distribution is said to satisfy the normal distribution, and is often referred to as
Definition 2: a countermeasure network is generated. The generation countermeasure network comprises two different neural networks, one called generator G and the other called discriminator D, which oppose each other during the training process, the purpose of the discriminator being to distinguish the true data distribution PdataSum generator distribution PGThe purpose of the generator is not to distinguish the two distributions by the discriminator, and the final result will be Pdata=PG。
Definition 3: an EM algorithm. I.e. the desired maximization algorithm. For observed data X, and non-observable data Z, both are collectively referred to as complete data D ═ X, Z. The EM algorithm firstly initializes a model and parameters thereof, and estimates Z by using the model, wherein the step is called as an E step; the model is then updated with the estimated Z, referred to as the M step.
Definition 4: a generalized kernel function. Is a function describing the point-to-point relationship and is also a function describing different spatial mapping relationships. Its choice is many, such as the dot product between vectors.
Definition 5: attention is paid to the mechanism. The attention mechanism typically includes 3 modules, query, key and value. The query and the key firstly carry out correlation operation, and finally carry out weighting operation with value, wherein the core operator isWherein f (·,) represents a generalized kernel function, x represents an input, C (x) represents a sum of xAnd g represents an arbitrary transformation, and the structure is schematically shown in fig. 3.
Definition 6: the EM attention mechanism. I.e., the combination of the EM algorithm and the attention mechanism, mainly by modifying and adding loop iteration steps to the attention mechanism.
Definition 7: adaptive channel domain EM attention mechanism. The method provided by the invention is an improvement on the EM attention mechanism, and is formed by changing the attention action domain to the channel domain and adding a new input. See step 3 for details.
Definition 8: softmax function. Or normalized exponential function, which "compresses" a K-dimensional vector x containing arbitrary real numbers into another K-dimensional real vector softmax (x) such that each element ranges between (0, 1) and the sum of all elements is 1. The formula can be expressed as:
definition 9: the Relu function. The modified linear unit is an activation function commonly used in artificial neural networks, and generally refers to a nonlinear function represented by a ramp function and a variant thereof, and the expression is f (x) max (0, x).
Therefore, the technical scheme of the invention is a multi-modal image style migration method based on an adaptive attention mechanism, which comprises the following steps:
step 1: preprocessing the data set;
acquiring an edges2shoes dataset, wherein the edges2shoes dataset comprises a shoe outline and real shoe pictures, and the total number of the obtained edges2shoes dataset is 49825 picture teams; classifying the data sets, wherein the shoe outlines are of one class, the real shoes are of the other class, and the sequential processing is randomly disturbed; finally, normalizing the pixel values of the picture to a range of [ -1,1 ];
step 2: constructing a convolution neural network and a full-connection neural network;
1) constructing a convolutional neural network comprising two sub-networks, one of which is a generator and the other is a discriminator; the input and output of the generator are pictures, while the input of the discriminator is a picture and the output is a scalar; the first two layers of the generator network are 2 down-sampling volume blocks, then 9 residual network blocks, and finally 2 up-sampling volume blocks; the discriminator network sequentially adopts 4 downsampling convolution blocks and two standard convolution blocks; the standard, up-sampled, down-sampled and residual network blocks are shown in fig. 5.
2) Construction of a fully-connected network input vector of size 8-D Representing dimensionality, assuming that the number of all channels of a generator in the constructed convolutional neural network is L, the output of the fully-connected network comprises two parts, wherein the first part is a vectorThe other part is a vectorWherein K is the number of the base groups in the step 3; the whole system comprises two hidden layers with 128 dimensions, the middle layer uses Relu function as an activation function, and the output layer uses Tanh function as a loss function;
step 3, constructing an adaptive channel domain EM attention module, referring to fig. 4, corresponding to the process in the Gaussian mixture model, after a picture is sent to a generator in a convolutional neural network, a feature graph obtained through the output of a convolution block in the generator is X, the size of the feature graph is C × H × W, wherein C is the number of channels, and H and W are the height and the width of the feature graph respectively;an N-dimensional vector representing an ith channel; given aAnd initializing a group of K basis vectors by normal distribution random samplingComposed matrixWherein K is less than N; step 3 is divided into the following three sub-steps; the first step is to estimate the hidden variablesThe second step is to update the base vector matrix M by using the estimation result of the first step; the first step and the second step are iterated circularly until mu and Z converge; thirdly, reconstructing X by using M and Z, and multiplying M by using S obtained in the step 2;
and 4, step 4: an overall neural network;
embedding the adaptive channel domain EM attention module in the step 3 into the generator in the step 2, wherein the adaptive channel domain EM attention module is embedded in 3 different parts in total; first before the first residual network block after the second layer down-sampling volume block, second at the position of the 5 th residual network block for replacement, and third at the position of the first up-sampling volume block after the last residual network block for embedding; multiplying the characteristic diagram control code d in the output of the fully-connected neural network into the output of all convolution layers in the generator, and multiplying the base control code S into the base M in the adaptive channel domain EM attention module obtained in the step 3; the output of the generator is used as the input of a discriminator, and the output of the discriminator is the output of the total neural network;
the overall network framework is shown in FIG. 1;
and 5: designing a loss function;
recording the shoe outline class picture as I in the picture acquired in the step 1AThe real shoe picture is IB(ii) a And randomly sampling the normal distribution to obtain a vector v, and a generator in the step 2The fully connected network is marked as G, and the discriminator is marked as D; the generator input in G is IAThe input of the fully connected network is v, the two act together and the output is denoted as G (I)AV); the input of the discriminator is IBAnd G (I)AV) their outputs are denoted D (I) respectivelyB) And D (G (I)AV)). The network loss can be described as:
in order to be a loss function of the discriminator,a loss function for the generator;respectively represent a pair (I)AV) and IBCalculating expectation;
step 6: training a total neural network, performing network training by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, and fixing the parameter of G when D is updated, wherein the D is updated alternately once every iteration;
and 7: in the testing stage, the model is trained in the step 6, and only the network G part is taken; giving an input picture IAAnd obtaining a plurality of output pictures with different styles by using different normal distribution samples v.
Further, the specific method of step 3 is as follows:
step 3.1: estimating hidden variablesThis step is to calculate the degree of responsibility of each base to each channel, i.e.The likelihood of each channel belonging to each base; z is a radical ofckDenotes the kth base pair of the k channel xcK is more than or equal to 1 and less than or equal to K and C is more than or equal to 1 and less than or equal to C; construction conditions in μcX ofcThe posterior probability distribution of (a) is as follows:
whereinThe generalized kernel function is represented; then z isckThe following formula can be used for calculation:
kernel functionSelectingFor the t-th iteration, the hidden variable Z is calculated using the following formula:
Z(t)=softmax(X(M(t-1))T)
step 3.2: update basis vector μ: the step is obtained by maximizing the likelihood function of the complete data, corresponds to a Gaussian mixture model, and is to use the weight value calculated in the first step to carry out weighted summation on the samples and update the value of a base by using the possibility that the samples belong to a certain base; for the t-th iteration, the update of the basis vectors can be represented by a weighted sum of X as:
step 3.3: after the step 3.1 and the step 3.2 are executed alternately for T times, the step 3.3 is carried out, X is reconstructed by M and Z, and the multiplication operation is carried out on mu by utilizing S obtained in the step 2; the length of the S obtained in the step 2 is K and is equal to the number of the basal mu; we can finally perform the reconstruction of X using the following formula:
the innovation of the invention is that:
1) the attention mechanism space domain is converted into a channel domain, the attention of the space domain is to use pixel points as variables to calculate the weight of the base pair pixel points, and the space domain is converted into the channel domain, namely, the weight of the base pair channel is calculated, as shown in fig. 6.
2) Adaptive weighting of the attention mechanism, weighting of the feature map can change the style of the output picture, but we replace weighting of the feature map with weighting of the basis in attention, as shown in fig. 7.
3) We introduced this approach into multi-modal style migration and achieved excellent results in the experiments.
1) The improvement in (1) can enable the attention mechanism to pay more attention to the style, and the improvement in (2) can enable the output style to be changed more accurately, and the combination of the two finally enables the experimental result to be improved.
Drawings
FIG. 1 is a diagram of the main network structure of the method of the present invention
FIG. 2 is a diagram illustrating multi-modal style migration results of the present invention.
FIG. 3 is a schematic view of the attention mechanism of the present invention.
FIG. 4 is a diagram of an adaptive channel domain EM attention machine mechanism according to the present invention.
Fig. 5 is a diagram of a standard convolutional block, an upsampled convolutional block, a downsampled convolutional block, and a residual block according to the present invention.
FIG. 6 is a schematic view of transforming the spatial domain attention into the channel domain attention according to the present invention.
FIG. 7 is a diagram illustrating an adaptive weighting method according to the present invention.
Detailed Description
Step 1: preprocessing the data set;
acquiring an edges2 photos (http:// efrosgans. eecs. berkeley. edu/pix2pix/datasets/edges2 photos. tar. gz) data set, wherein the edges2 photos data set contains a shoe outline and real shoe pictures, and the total number is 49825 pairs of pictures; classifying the data sets, wherein the shoe outlines are of one class, the real shoes are of the other class, and the sequential processing is randomly disturbed; and finally, normalizing the pixel values of the picture to the range of [ -1,1 ].
Step 2: constructing a convolution neural network and a full-connection neural network;
1) the convolutional neural network constructed in the step comprises two sub-networks, wherein one sub-network is a generator, and the other sub-network is a discriminator; the input and output of the generator are pictures, while the input of the discriminator is a picture and the output is a scalar; the first two layers of the generator network are 2 down-sampling volume blocks, then 9 residual network blocks, and finally 2 up-sampling volume blocks; the discriminator network sequentially adopts 4 downsampling convolution blocks and two standard convolution blocks; the standard, up-sampled, down-sampled and residual network blocks are shown in fig. 5.
2) The full-connection network constructed by the step inputs 8-dimensional vectors Representing dimensionality, assuming that the number of all convolution kernels of a generator in the constructed convolution neural network is L, the output of the fully-connected network comprises two parts, wherein the first part is a vectorThe other part is a vectorWherein K is the number of the base groups in the step 3; the whole system comprises two hidden layers with 128 dimensions, the middle layer uses Relu function as an activation function, and the output layer uses Tanh function as a loss function;
3) d in the output of the fully-connected neural network is multiplied by all the convolved outputs in the generator, while S is multiplied by the basis M in the adaptive channel domain EM attention module (constructed in step 3).
Step 3, constructing an adaptive channel domain EM attention module, referring to fig. 4, corresponding to the process in the Gaussian mixture model, after a picture is sent to a generator in a convolutional neural network, a feature graph obtained through the output of a convolution block in the generator is X, the size of the feature graph is C × H × W, wherein C is the number of channels, and H and W are the height and the width of the feature graph respectively;an N-dimensional vector representing an ith channel; given aAnd initializing a group of K basis vectors by normal distribution random samplingComposed matrixWherein K is less than N; step 3 is divided into the following three sub-steps; the first step is to estimate the hidden variablesThe second step is to update the base vector matrix M by using the estimation result of the first step; the first step and the second step are iterated circularly until mu and Z converge; thirdly, reconstructing X by using M and Z, and multiplying M by using S obtained in the step 2;
and 4, step 4: an overall neural network structure;
embedding the adaptive channel domain EM attention module in the step 3 into the generator in the step 2, wherein the adaptive channel domain EM attention module is embedded in 3 different parts in total; first before the first residual network block after the second layer down-sampling volume block, second at the position of the 5 th residual network block for replacement, and third at the position of the first up-sampling volume block after the last residual network block for embedding; the overall network framework is shown in FIG. 1;
and 5: designing a loss function;
recording the shoe outline class picture as I in the picture acquired in the step 1AThe real shoe picture is IB(ii) a Randomly sampling the normal distribution to obtain a vector v, recording the generator and the full-connection network in the step 2 as G, and recording the discriminator as D; the generator input in G is IAThe input of the fully connected network is v, the two act together and the output is denoted as G (I)AV); the input of the discriminator is IBAnd G (I)AV) their outputs are denoted D (I) respectivelyB) And D (G (I)AV)). The network loss can be described as:
in order to be a loss function of the discriminator,a loss function for the generator;respectively represent a pair (I)AV) and IBCalculating expectation;
step 6: training a network, namely training the network by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, and if D is updated, updating the D alternately once every iteration, wherein 1000000 iteration times are adopted in actual training;
and 7: test phase, training model in step 6, only taking networkAnd a G part. Giving an input picture IAAnd different normal distribution samples v can obtain a plurality of output pictures with different styles, and the tests of the picture quality and the picture diversity are completed. According to an experimental result, on the edges2shoes data set, the picture quality score is improved by 0.15 point compared with the previous 10.32 points and reaches 10.47 points; the diversity score of the picture is improved by 0.005 point compared with the previous 0.109 point and reaches 0.114 point.
Further, the specific method of step 3 is as follows:
step 3.1: estimating hidden variablesThis step is to calculate the degree of responsibility of each base for each channel, i.e. the probability that each channel belongs to each base; z is a radical ofckDenotes the kth base pair of the k channel xcK is more than or equal to 1 and less than or equal to K and C is more than or equal to 1 and less than or equal to C; construction conditions in μcX ofcThe posterior probability distribution of (a) is as follows:
whereinThe generalized kernel function is represented; then z isckThe following formula can be used for calculation:
kernel functionSelectingFor the t-th iteration, the hidden variable Z is calculated using the following formula:
Z(t)=softmax(X(M(t-1))T)
step 3.2: update basis vector μ: the step is obtained by maximizing the likelihood function of the complete data, corresponds to a Gaussian mixture model, and is to use the weight value calculated in the first step to carry out weighted summation on the samples and update the value of a base by using the possibility that the samples belong to a certain base; for the t-th iteration, the update of the basis vectors can be represented by a weighted sum of X as:
step 3.3: after step 3.1 and step 3.2 are performed alternately T times, step 3.3 is performed to reconstruct X with M and Z, and multiply μ with S obtained in step 2: the length of the S obtained in the step 2 is K and is equal to the number of the basal mu; we can finally perform the reconstruction of X using the following formula:
the picture size is as follows: 256*256*3
Learning rate: 0.0002 and decreases linearly with the number of iterations
Training batch size: 1
Iteration times are as follows: 1000000
Iteration times T of an adaptive channel domain EM attention module: 3.
Claims (2)
1. an adaptive attention mechanism-based multi-modal image style migration method, comprising:
step 1: preprocessing the data set;
acquiring an edges2shoes dataset, wherein the edges2shoes dataset comprises a shoe outline and real shoe pictures, and the total number of the obtained edges2shoes dataset is 49825 picture teams; classifying the data sets, wherein the shoe outlines are of one class, the real shoes are of the other class, and the sequential processing is randomly disturbed; finally, normalizing the pixel values of the picture to a range of [ -1,1 ];
step 2: constructing a convolution neural network and a full-connection neural network;
1) constructing a convolutional neural network comprising two sub-networks, one of which is a generator and the other is a discriminator; the input and output of the generator are pictures, while the input of the discriminator is a picture and the output is a scalar; the first two layers of the generator network are 2 down-sampling volume blocks, then 9 residual network blocks, and finally 2 up-sampling volume blocks; the discriminator network sequentially adopts 4 downsampling convolution blocks and two standard convolution blocks;
2) construction of a fully-connected network input vector of size 8-D Representing dimensionality, assuming that the number of all channels of a generator in the constructed convolutional neural network is L, the output of the fully-connected network comprises two parts, wherein the first part is a vectorThe other part is a vectorWherein K is the number of the base groups in the step 3; the whole system comprises two hidden layers with 128 dimensions, the middle layer uses Relu function as an activation function, and the output layer uses Tanh function as a loss function;
step 3, constructing an adaptive channel domain EM attention module, corresponding to the process in a Gaussian mixture model, and after a picture is sent to a generator in a convolutional neural network, obtaining a characteristic diagram through the output of a convolution block in the generator, wherein the characteristic diagram is X, the size is C × H × W, the C is the channel number, and the H and the W are respectively the height and the width of the characteristic diagram;an N-dimensional vector representing an ith channel; given aAnd initializing a group of K basis vectors by normal distribution random samplingComposed matrixWherein K is less than N; step 3 is divided into the following three sub-steps; the first step is to estimate the hidden variablesThe second step is to update the base vector matrix M by using the estimation result of the first step; the first step and the second step are iterated circularly until mu and Z converge; thirdly, reconstructing X by using M and Z, and multiplying M by using S obtained in the step 2;
and 4, step 4: an overall neural network;
embedding the adaptive channel domain EM attention module in the step 3 into the generator in the step 2, wherein the adaptive channel domain EM attention module is embedded in 3 different parts in total; first before the first residual network block after the second layer down-sampling volume block, second at the position of the 5 th residual network block for replacement, and third at the position of the first up-sampling volume block after the last residual network block for embedding; multiplying the characteristic diagram control code d in the output of the fully-connected neural network into the output of all convolution layers in the generator, and multiplying the base control code S into the base M in the adaptive channel domain EM attention module obtained in the step 3; the output of the generator is used as the input of a discriminator, and the output of the discriminator is the output of the total neural network;
and 5: designing a loss function;
recording the shoe outline class picture as I in the picture acquired in the step 1AThe real shoe picture is IB(ii) a And randomly sampling the normal distribution to obtain a vector v, and the generator and the full link in the step 2Recording as G together with the network, and recording as D by the discriminator; the generator input in G is IAThe input of the fully connected network is v, the two act together and the output is denoted as G (I)AV); the input of the discriminator is IBAnd G (I)AV) their outputs are denoted D (I) respectivelyB) And D (G (I)AV)). The network loss can be described as:
in order to be a loss function of the discriminator,a loss function for the generator; respectively represent a pair (I)AV) and IBCalculating expectation;
step 6: training a total neural network, performing network training by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, and fixing the parameter of G when D is updated, wherein the D is updated alternately once every iteration;
and 7: in the testing stage, the model is trained in the step 6, and only the network G part is taken; giving an input picture IAAnd obtaining a plurality of output pictures with different styles by using different normal distribution samples v.
2. The method for multi-modal image style migration based on adaptive attention mechanism as claimed in claim 1, wherein the specific method of step 3 is:
step 3.1: estimating hidden variablesThis step is to calculate the degree of responsibility of each base for each channel, i.e. the probability that each channel belongs to each base; z is a radical ofckDenotes the kth base pair of the k channel xcK is more than or equal to 1 and less than or equal to K and C is more than or equal to 1 and less than or equal to C; construction conditions in μcX ofcThe posterior probability distribution of (a) is as follows:
whereinThe generalized kernel function is represented; then z isckThe following formula can be used for calculation:
kernel functionSelection of exp (a)Tb) For the t-th iteration, the hidden variable Z is calculated using the following formula:
Z(t)=softmax(X(M(t-1))T)
step 3.2: for the t-th iteration, the update of the basis vectors can be represented by a weighted sum of X as:
step 3.3: after the step 3.1 and the step 3.2 are executed alternately for T times, the step 3.3 is carried out, X is reconstructed by M and Z, and the multiplication operation is carried out on mu by utilizing S obtained in the step 2; the length of the S obtained in the step 2 is K and is equal to the number of the basal mu; we can finally perform the reconstruction of X using the following formula:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010431594.7A CN111696027B (en) | 2020-05-20 | 2020-05-20 | Multi-modal image style migration method based on adaptive attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010431594.7A CN111696027B (en) | 2020-05-20 | 2020-05-20 | Multi-modal image style migration method based on adaptive attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111696027A true CN111696027A (en) | 2020-09-22 |
CN111696027B CN111696027B (en) | 2023-04-07 |
Family
ID=72478084
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010431594.7A Expired - Fee Related CN111696027B (en) | 2020-05-20 | 2020-05-20 | Multi-modal image style migration method based on adaptive attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111696027B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112614047A (en) * | 2020-12-18 | 2021-04-06 | 西北大学 | Facial makeup image style migration method based on TuiGAN improvement |
CN112819692A (en) * | 2021-02-21 | 2021-05-18 | 北京工业大学 | Real-time arbitrary style migration method based on double attention modules |
CN113379655A (en) * | 2021-05-18 | 2021-09-10 | 电子科技大学 | Image synthesis method for generating antagonistic network based on dynamic self-attention |
CN113421318A (en) * | 2021-06-30 | 2021-09-21 | 合肥高维数据技术有限公司 | Font style migration method and system based on multitask generation countermeasure network |
CN113450313A (en) * | 2021-06-04 | 2021-09-28 | 电子科技大学 | Image significance visualization method based on regional contrast learning |
CN113538224A (en) * | 2021-09-14 | 2021-10-22 | 深圳市安软科技股份有限公司 | Image style migration method and device based on generation countermeasure network and related equipment |
CN114037770A (en) * | 2021-10-27 | 2022-02-11 | 电子科技大学长三角研究院(衢州) | Discrete Fourier transform-based attention mechanism image generation method |
CN114037600A (en) * | 2021-10-11 | 2022-02-11 | 长沙理工大学 | New cycleGAN style migration network based on new attention mechanism |
CN115375601A (en) * | 2022-10-25 | 2022-11-22 | 四川大学 | Decoupling expression traditional Chinese painting generation method based on attention mechanism |
CN117635418A (en) * | 2024-01-25 | 2024-03-01 | 南京信息工程大学 | Training method for generating countermeasure network, bidirectional image style conversion method and device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160169782A1 (en) * | 2014-12-10 | 2016-06-16 | Nikhilesh Chawla | Fixture for in situ electromigration testing during x-ray microtomography |
CN108564119A (en) * | 2018-04-04 | 2018-09-21 | 华中科技大学 | A kind of any attitude pedestrian Picture Generation Method |
CN110110745A (en) * | 2019-03-29 | 2019-08-09 | 上海海事大学 | Based on the semi-supervised x-ray image automatic marking for generating confrontation network |
CN110288609A (en) * | 2019-05-30 | 2019-09-27 | 南京师范大学 | A kind of multi-modal whole-heartedly dirty image partition method of attention mechanism guidance |
CN110322423A (en) * | 2019-04-29 | 2019-10-11 | 天津大学 | A kind of multi-modality images object detection method based on image co-registration |
CN110415184A (en) * | 2019-06-28 | 2019-11-05 | 南开大学 | A kind of multi-modality images Enhancement Method based on orthogonal first space |
CN110503598A (en) * | 2019-07-30 | 2019-11-26 | 西安理工大学 | The font style moving method of confrontation network is generated based on condition circulation consistency |
CN110580509A (en) * | 2019-09-12 | 2019-12-17 | 杭州海睿博研科技有限公司 | multimodal data processing system and method for generating countermeasure model based on hidden representation and depth |
CN111161200A (en) * | 2019-12-22 | 2020-05-15 | 天津大学 | Human body posture migration method based on attention mechanism |
CN111161272A (en) * | 2019-12-31 | 2020-05-15 | 北京理工大学 | Embryo tissue segmentation method based on generation of confrontation network |
-
2020
- 2020-05-20 CN CN202010431594.7A patent/CN111696027B/en not_active Expired - Fee Related
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160169782A1 (en) * | 2014-12-10 | 2016-06-16 | Nikhilesh Chawla | Fixture for in situ electromigration testing during x-ray microtomography |
CN108564119A (en) * | 2018-04-04 | 2018-09-21 | 华中科技大学 | A kind of any attitude pedestrian Picture Generation Method |
CN110110745A (en) * | 2019-03-29 | 2019-08-09 | 上海海事大学 | Based on the semi-supervised x-ray image automatic marking for generating confrontation network |
CN110322423A (en) * | 2019-04-29 | 2019-10-11 | 天津大学 | A kind of multi-modality images object detection method based on image co-registration |
CN110288609A (en) * | 2019-05-30 | 2019-09-27 | 南京师范大学 | A kind of multi-modal whole-heartedly dirty image partition method of attention mechanism guidance |
CN110415184A (en) * | 2019-06-28 | 2019-11-05 | 南开大学 | A kind of multi-modality images Enhancement Method based on orthogonal first space |
CN110503598A (en) * | 2019-07-30 | 2019-11-26 | 西安理工大学 | The font style moving method of confrontation network is generated based on condition circulation consistency |
CN110580509A (en) * | 2019-09-12 | 2019-12-17 | 杭州海睿博研科技有限公司 | multimodal data processing system and method for generating countermeasure model based on hidden representation and depth |
CN111161200A (en) * | 2019-12-22 | 2020-05-15 | 天津大学 | Human body posture migration method based on attention mechanism |
CN111161272A (en) * | 2019-12-31 | 2020-05-15 | 北京理工大学 | Embryo tissue segmentation method based on generation of confrontation network |
Non-Patent Citations (2)
Title |
---|
LILI PAN等: "Latent Dirichlet Allocation in Generative Adversatial Networks", 《MACHINE LEARNING》 * |
李泽田等: "基于局部期望最大化注意力的图像降噪", 《液晶与显示》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112614047A (en) * | 2020-12-18 | 2021-04-06 | 西北大学 | Facial makeup image style migration method based on TuiGAN improvement |
CN112614047B (en) * | 2020-12-18 | 2023-07-28 | 西北大学 | TuiGAN-based improved facial makeup image style migration method |
CN112819692A (en) * | 2021-02-21 | 2021-05-18 | 北京工业大学 | Real-time arbitrary style migration method based on double attention modules |
CN112819692B (en) * | 2021-02-21 | 2023-10-31 | 北京工业大学 | Real-time arbitrary style migration method based on dual-attention module |
CN113379655B (en) * | 2021-05-18 | 2022-07-29 | 电子科技大学 | Image synthesis method for generating antagonistic network based on dynamic self-attention |
CN113379655A (en) * | 2021-05-18 | 2021-09-10 | 电子科技大学 | Image synthesis method for generating antagonistic network based on dynamic self-attention |
CN113450313A (en) * | 2021-06-04 | 2021-09-28 | 电子科技大学 | Image significance visualization method based on regional contrast learning |
CN113450313B (en) * | 2021-06-04 | 2022-03-15 | 电子科技大学 | Image significance visualization method based on regional contrast learning |
CN113421318A (en) * | 2021-06-30 | 2021-09-21 | 合肥高维数据技术有限公司 | Font style migration method and system based on multitask generation countermeasure network |
CN113538224B (en) * | 2021-09-14 | 2022-01-14 | 深圳市安软科技股份有限公司 | Image style migration method and device based on generation countermeasure network and related equipment |
CN113538224A (en) * | 2021-09-14 | 2021-10-22 | 深圳市安软科技股份有限公司 | Image style migration method and device based on generation countermeasure network and related equipment |
CN114037600A (en) * | 2021-10-11 | 2022-02-11 | 长沙理工大学 | New cycleGAN style migration network based on new attention mechanism |
CN114037770A (en) * | 2021-10-27 | 2022-02-11 | 电子科技大学长三角研究院(衢州) | Discrete Fourier transform-based attention mechanism image generation method |
CN114037770B (en) * | 2021-10-27 | 2024-08-16 | 电子科技大学长三角研究院(衢州) | Image generation method of attention mechanism based on discrete Fourier transform |
CN115375601A (en) * | 2022-10-25 | 2022-11-22 | 四川大学 | Decoupling expression traditional Chinese painting generation method based on attention mechanism |
CN117635418A (en) * | 2024-01-25 | 2024-03-01 | 南京信息工程大学 | Training method for generating countermeasure network, bidirectional image style conversion method and device |
CN117635418B (en) * | 2024-01-25 | 2024-05-14 | 南京信息工程大学 | Training method for generating countermeasure network, bidirectional image style conversion method and device |
Also Published As
Publication number | Publication date |
---|---|
CN111696027B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111696027B (en) | Multi-modal image style migration method based on adaptive attention mechanism | |
Putzky et al. | Recurrent inference machines for solving inverse problems | |
EP3298576B1 (en) | Training a neural network | |
US20190087726A1 (en) | Hypercomplex deep learning methods, architectures, and apparatus for multimodal small, medium, and large-scale data representation, analysis, and applications | |
CN109711426B (en) | Pathological image classification device and method based on GAN and transfer learning | |
CN114187331B (en) | Unsupervised optical flow estimation method based on Transformer feature pyramid network | |
Sprechmann et al. | Supervised sparse analysis and synthesis operators | |
CN113379655B (en) | Image synthesis method for generating antagonistic network based on dynamic self-attention | |
EP4341914A1 (en) | Generating images using sequences of generative neural networks | |
Hu et al. | Image super-resolution with self-similarity prior guided network and sample-discriminating learning | |
CN114648787A (en) | Face image processing method and related equipment | |
CN111986085A (en) | Image super-resolution method based on depth feedback attention network system | |
CN116797456A (en) | Image super-resolution reconstruction method, system, device and storage medium | |
CN114037770B (en) | Image generation method of attention mechanism based on discrete Fourier transform | |
Huang et al. | Learning deep analysis dictionaries for image super-resolution | |
Fakhari et al. | A new restricted boltzmann machine training algorithm for image restoration | |
Gao et al. | Rank-one network: An effective framework for image restoration | |
Moeller et al. | Image denoising—old and new | |
Zhao et al. | Face super-resolution via triple-attention feature fusion network | |
Gangloff et al. | A general parametrization framework for pairwise Markov models: An application to unsupervised image segmentation | |
CN115601787A (en) | Rapid human body posture estimation method based on abbreviated representation | |
CN115410000A (en) | Object classification method and device | |
CN115601257A (en) | Image deblurring method based on local features and non-local features | |
Zandavi | Post-trained convolution networks for single image super-resolution | |
Basioti et al. | Image restoration from parametric transformations using generative models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20230407 |
|
CF01 | Termination of patent right due to non-payment of annual fee |