CN111696027A - Multi-modal image style migration method based on adaptive attention mechanism - Google Patents

Multi-modal image style migration method based on adaptive attention mechanism Download PDF

Info

Publication number
CN111696027A
CN111696027A CN202010431594.7A CN202010431594A CN111696027A CN 111696027 A CN111696027 A CN 111696027A CN 202010431594 A CN202010431594 A CN 202010431594A CN 111696027 A CN111696027 A CN 111696027A
Authority
CN
China
Prior art keywords
network
generator
output
picture
discriminator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010431594.7A
Other languages
Chinese (zh)
Other versions
CN111696027B (en
Inventor
程深
潘力立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010431594.7A priority Critical patent/CN111696027B/en
Publication of CN111696027A publication Critical patent/CN111696027A/en
Application granted granted Critical
Publication of CN111696027B publication Critical patent/CN111696027B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/04Context-preserving transformations, e.g. by using an importance map
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-modal image style migration method based on an adaptive attention mechanism, and belongs to the field of computer vision. The method comprises the steps of firstly selecting and using a generated countermeasure network as a basic frame, meanwhile, using the idea of EM (effective magnetic field) attention mechanism algorithm and channel scale transformation as reference, improving the EM attention mechanism on a channel domain, enabling the network to increase attention to style characteristics, weighting bases in an attention module by using noise, enabling the bases to be changed adaptively, and finally enabling the style to be changed. At this time, noise and pictures are simultaneously input into the network, and a confrontation training algorithm is used for generating a confrontation network. After the network is trained, multi-modal style migration can be performed by changing the noise. By the method, the advantages of an EM (effective man) attention mechanism and generation of a countermeasure network are fully utilized, the adaptive channel domain EM attention module is provided, and the image quality and the image diversity of the existing method after style migration are improved.

Description

Multi-modal image style migration method based on adaptive attention mechanism
Technical Field
The invention belongs to the field of computer vision, and mainly relates to a multi-modal image style migration problem; the method is mainly applied to the aspects of film and television entertainment industry, man-machine interaction, machine vision understanding and the like.
Background
The image style migration refers to a technology for converting the style of a picture into other different styles under the condition of keeping the picture content after analyzing the pictures of different styles through a computer technology. The image style migration is increasingly demanded in the fields of film and television entertainment industry, human-computer interaction, machine vision understanding and the like. For example: the character head portrait can be changed into a cartoon character head portrait in real time through the camera; in the automatic driving, the conversion of the picture into the divided picture or the like may be assisted with style migration. The existing image style migration methods are mainly divided into methods based on image optimization and model optimization.
The style migration method based on the picture optimization is a method with an early appearance time and a stable appearance time, and the basic principle of the method can be divided into three steps. The method comprises the first step of selecting a neural network capable of extracting picture features, the second step of utilizing the neural network to extract features of an original picture and a target picture and utilizing the features to design a loss function, and the third step of utilizing the loss function to conduct derivation operation on the original picture and continuously optimizing iteration to enable the style of the original picture to be close to that of the target picture. This type of method does not require a large amount of data, and is therefore simple and convenient to operate; but has the disadvantage that the iteration time is too long to convert the picture in real time. Reference documents: l.a.gamys, a.s.ecker, m.bethg.image style using a volumetric neural network, IEEE Conference on computer vision and Pattern Recognition,2016, pp.2414-2423.
The model iteration-based method mainly trains a model through a large number of pictures with different styles, so that the model learns a mapping function between one picture style and other picture styles, and then inputs one style picture into the trained model, so that pictures with different styles and consistent contents can be obtained from the output of the model. The method has the advantages that iteration steps are not needed after the model is trained, the image style migration can be carried out in real time, and the method can achieve the purpose of inputting one type of pictures and simultaneously outputting a plurality of pictures with different types of styles by inputting additional variables. However, there are disadvantages in that the style class which is not available in training cannot be transferred well in testing, and in that the sample diversity is still insufficient in multi-modal style transfer. Reference documents: yazeed, S.Neil, W.Peter.Latent filtering for multimodal unsupervised image-to-image transformation.2019, pp.1458-1466
In recent years, model optimization-based methods have become more sophisticated, and the demand for multi-modal style migration has increased. In the multi-modal style migration, the diversity and the picture quality of the current method are still insufficient. The multi-modal style migration means that given an input picture, different style pictures can be output at the same time, as shown in fig. 2, the first is the input picture, and the other is multi-style pictures output at the same time. Aiming at the field and considering the defects, the invention provides a multi-modal image style migration method based on an adaptive attention mechanism, and excellent results are obtained.
Disclosure of Invention
The invention discloses a multi-mode style migration method of an adaptive channel domain EM attention mechanism, and solves the problem of lacking in style diversity in the prior art.
The method first chooses to use the generative confrontation network as a basic framework, normalizes the training picture to 256 × 3, and samples the normal distribution to obtain noise. Meanwhile, the idea of an EM (effective electromagnetic force) attention mechanism algorithm and channel scale transformation is used for reference, the EM attention mechanism is improved in a channel domain, the attention degree of a network on style characteristics is increased, the noise is used for weighting the basis in the attention module, the basis can be changed adaptively, and finally the style is changed. At this time, noise and pictures are simultaneously input into the network, and a confrontation training algorithm is used for generating a confrontation network. After the network is trained, multi-modal style migration can be performed by changing the noise. By the method, the advantages of an EM (effective man) attention mechanism and generation of a countermeasure network are fully utilized, the adaptive channel domain EM attention module is provided, and the image quality and the image diversity of the existing method after style migration are improved. The general structure of the algorithm is schematically shown in fig. 1.
For convenience in describing the present disclosure, certain terms are first defined.
Definition 1: a normal distribution.Also called normal distribution, also known as gaussian distribution, is a probability distribution that is very important in the fields of mathematics, physics, engineering, etc., and has a significant influence on many aspects of statistics. If the random variable x, its probability density function satisfies
Figure BDA0002500722230000021
Where μ is the mathematical expectation of a normal distribution, σ2The variance of a normal distribution is said to satisfy the normal distribution, and is often referred to as
Figure BDA0002500722230000022
Definition 2: a countermeasure network is generated. The generation countermeasure network comprises two different neural networks, one called generator G and the other called discriminator D, which oppose each other during the training process, the purpose of the discriminator being to distinguish the true data distribution PdataSum generator distribution PGThe purpose of the generator is not to distinguish the two distributions by the discriminator, and the final result will be Pdata=PG
Definition 3: an EM algorithm. I.e. the desired maximization algorithm. For observed data X, and non-observable data Z, both are collectively referred to as complete data D ═ X, Z. The EM algorithm firstly initializes a model and parameters thereof, and estimates Z by using the model, wherein the step is called as an E step; the model is then updated with the estimated Z, referred to as the M step.
Definition 4: a generalized kernel function. Is a function describing the point-to-point relationship and is also a function describing different spatial mapping relationships. Its choice is many, such as the dot product between vectors.
Definition 5: attention is paid to the mechanism. The attention mechanism typically includes 3 modules, query, key and value. The query and the key firstly carry out correlation operation, and finally carry out weighting operation with value, wherein the core operator is
Figure BDA0002500722230000031
Wherein f (·,) represents a generalized kernel function, x represents an input, C (x) represents a sum of xAnd g represents an arbitrary transformation, and the structure is schematically shown in fig. 3.
Definition 6: the EM attention mechanism. I.e., the combination of the EM algorithm and the attention mechanism, mainly by modifying and adding loop iteration steps to the attention mechanism.
Definition 7: adaptive channel domain EM attention mechanism. The method provided by the invention is an improvement on the EM attention mechanism, and is formed by changing the attention action domain to the channel domain and adding a new input. See step 3 for details.
Definition 8: softmax function. Or normalized exponential function, which "compresses" a K-dimensional vector x containing arbitrary real numbers into another K-dimensional real vector softmax (x) such that each element ranges between (0, 1) and the sum of all elements is 1. The formula can be expressed as:
Figure BDA0002500722230000032
definition 9: the Relu function. The modified linear unit is an activation function commonly used in artificial neural networks, and generally refers to a nonlinear function represented by a ramp function and a variant thereof, and the expression is f (x) max (0, x).
Definition 10: tanh function. Can be expressed
Figure BDA0002500722230000033
And (4) defining.
Therefore, the technical scheme of the invention is a multi-modal image style migration method based on an adaptive attention mechanism, which comprises the following steps:
step 1: preprocessing the data set;
acquiring an edges2shoes dataset, wherein the edges2shoes dataset comprises a shoe outline and real shoe pictures, and the total number of the obtained edges2shoes dataset is 49825 picture teams; classifying the data sets, wherein the shoe outlines are of one class, the real shoes are of the other class, and the sequential processing is randomly disturbed; finally, normalizing the pixel values of the picture to a range of [ -1,1 ];
step 2: constructing a convolution neural network and a full-connection neural network;
1) constructing a convolutional neural network comprising two sub-networks, one of which is a generator and the other is a discriminator; the input and output of the generator are pictures, while the input of the discriminator is a picture and the output is a scalar; the first two layers of the generator network are 2 down-sampling volume blocks, then 9 residual network blocks, and finally 2 up-sampling volume blocks; the discriminator network sequentially adopts 4 downsampling convolution blocks and two standard convolution blocks; the standard, up-sampled, down-sampled and residual network blocks are shown in fig. 5.
2) Construction of a fully-connected network input vector of size 8-D
Figure BDA0002500722230000034
Figure BDA0002500722230000035
Representing dimensionality, assuming that the number of all channels of a generator in the constructed convolutional neural network is L, the output of the fully-connected network comprises two parts, wherein the first part is a vector
Figure BDA0002500722230000036
The other part is a vector
Figure BDA0002500722230000041
Wherein K is the number of the base groups in the step 3; the whole system comprises two hidden layers with 128 dimensions, the middle layer uses Relu function as an activation function, and the output layer uses Tanh function as a loss function;
step 3, constructing an adaptive channel domain EM attention module, referring to fig. 4, corresponding to the process in the Gaussian mixture model, after a picture is sent to a generator in a convolutional neural network, a feature graph obtained through the output of a convolution block in the generator is X, the size of the feature graph is C × H × W, wherein C is the number of channels, and H and W are the height and the width of the feature graph respectively;
Figure BDA0002500722230000042
an N-dimensional vector representing an ith channel; given a
Figure BDA0002500722230000043
And initializing a group of K basis vectors by normal distribution random sampling
Figure BDA0002500722230000044
Composed matrix
Figure BDA0002500722230000045
Wherein K is less than N; step 3 is divided into the following three sub-steps; the first step is to estimate the hidden variables
Figure BDA0002500722230000046
The second step is to update the base vector matrix M by using the estimation result of the first step; the first step and the second step are iterated circularly until mu and Z converge; thirdly, reconstructing X by using M and Z, and multiplying M by using S obtained in the step 2;
and 4, step 4: an overall neural network;
embedding the adaptive channel domain EM attention module in the step 3 into the generator in the step 2, wherein the adaptive channel domain EM attention module is embedded in 3 different parts in total; first before the first residual network block after the second layer down-sampling volume block, second at the position of the 5 th residual network block for replacement, and third at the position of the first up-sampling volume block after the last residual network block for embedding; multiplying the characteristic diagram control code d in the output of the fully-connected neural network into the output of all convolution layers in the generator, and multiplying the base control code S into the base M in the adaptive channel domain EM attention module obtained in the step 3; the output of the generator is used as the input of a discriminator, and the output of the discriminator is the output of the total neural network;
the overall network framework is shown in FIG. 1;
and 5: designing a loss function;
recording the shoe outline class picture as I in the picture acquired in the step 1AThe real shoe picture is IB(ii) a And randomly sampling the normal distribution to obtain a vector v, and a generator in the step 2The fully connected network is marked as G, and the discriminator is marked as D; the generator input in G is IAThe input of the fully connected network is v, the two act together and the output is denoted as G (I)AV); the input of the discriminator is IBAnd G (I)AV) their outputs are denoted D (I) respectivelyB) And D (G (I)AV)). The network loss can be described as:
Figure BDA0002500722230000047
Figure BDA0002500722230000048
Figure BDA0002500722230000049
in order to be a loss function of the discriminator,
Figure BDA00025007222300000410
a loss function for the generator;
Figure BDA00025007222300000411
respectively represent a pair (I)AV) and IBCalculating expectation;
step 6: training a total neural network, performing network training by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, and fixing the parameter of G when D is updated, wherein the D is updated alternately once every iteration;
and 7: in the testing stage, the model is trained in the step 6, and only the network G part is taken; giving an input picture IAAnd obtaining a plurality of output pictures with different styles by using different normal distribution samples v.
Further, the specific method of step 3 is as follows:
step 3.1: estimating hidden variables
Figure BDA0002500722230000051
This step is to calculate the degree of responsibility of each base to each channel, i.e.The likelihood of each channel belonging to each base; z is a radical ofckDenotes the kth base pair of the k channel xcK is more than or equal to 1 and less than or equal to K and C is more than or equal to 1 and less than or equal to C; construction conditions in μcX ofcThe posterior probability distribution of (a) is as follows:
Figure BDA0002500722230000052
wherein
Figure BDA0002500722230000053
The generalized kernel function is represented; then z isckThe following formula can be used for calculation:
Figure BDA0002500722230000054
kernel function
Figure BDA0002500722230000055
Selecting
Figure BDA0002500722230000058
For the t-th iteration, the hidden variable Z is calculated using the following formula:
Z(t)=softmax(X(M(t-1))T)
step 3.2: update basis vector μ: the step is obtained by maximizing the likelihood function of the complete data, corresponds to a Gaussian mixture model, and is to use the weight value calculated in the first step to carry out weighted summation on the samples and update the value of a base by using the possibility that the samples belong to a certain base; for the t-th iteration, the update of the basis vectors can be represented by a weighted sum of X as:
Figure BDA0002500722230000056
step 3.3: after the step 3.1 and the step 3.2 are executed alternately for T times, the step 3.3 is carried out, X is reconstructed by M and Z, and the multiplication operation is carried out on mu by utilizing S obtained in the step 2; the length of the S obtained in the step 2 is K and is equal to the number of the basal mu; we can finally perform the reconstruction of X using the following formula:
Figure BDA0002500722230000057
the innovation of the invention is that:
1) the attention mechanism space domain is converted into a channel domain, the attention of the space domain is to use pixel points as variables to calculate the weight of the base pair pixel points, and the space domain is converted into the channel domain, namely, the weight of the base pair channel is calculated, as shown in fig. 6.
2) Adaptive weighting of the attention mechanism, weighting of the feature map can change the style of the output picture, but we replace weighting of the feature map with weighting of the basis in attention, as shown in fig. 7.
3) We introduced this approach into multi-modal style migration and achieved excellent results in the experiments.
1) The improvement in (1) can enable the attention mechanism to pay more attention to the style, and the improvement in (2) can enable the output style to be changed more accurately, and the combination of the two finally enables the experimental result to be improved.
Drawings
FIG. 1 is a diagram of the main network structure of the method of the present invention
FIG. 2 is a diagram illustrating multi-modal style migration results of the present invention.
FIG. 3 is a schematic view of the attention mechanism of the present invention.
FIG. 4 is a diagram of an adaptive channel domain EM attention machine mechanism according to the present invention.
Fig. 5 is a diagram of a standard convolutional block, an upsampled convolutional block, a downsampled convolutional block, and a residual block according to the present invention.
FIG. 6 is a schematic view of transforming the spatial domain attention into the channel domain attention according to the present invention.
FIG. 7 is a diagram illustrating an adaptive weighting method according to the present invention.
Detailed Description
Step 1: preprocessing the data set;
acquiring an edges2 photos (http:// efrosgans. eecs. berkeley. edu/pix2pix/datasets/edges2 photos. tar. gz) data set, wherein the edges2 photos data set contains a shoe outline and real shoe pictures, and the total number is 49825 pairs of pictures; classifying the data sets, wherein the shoe outlines are of one class, the real shoes are of the other class, and the sequential processing is randomly disturbed; and finally, normalizing the pixel values of the picture to the range of [ -1,1 ].
Step 2: constructing a convolution neural network and a full-connection neural network;
1) the convolutional neural network constructed in the step comprises two sub-networks, wherein one sub-network is a generator, and the other sub-network is a discriminator; the input and output of the generator are pictures, while the input of the discriminator is a picture and the output is a scalar; the first two layers of the generator network are 2 down-sampling volume blocks, then 9 residual network blocks, and finally 2 up-sampling volume blocks; the discriminator network sequentially adopts 4 downsampling convolution blocks and two standard convolution blocks; the standard, up-sampled, down-sampled and residual network blocks are shown in fig. 5.
2) The full-connection network constructed by the step inputs 8-dimensional vectors
Figure BDA0002500722230000061
Figure BDA0002500722230000062
Representing dimensionality, assuming that the number of all convolution kernels of a generator in the constructed convolution neural network is L, the output of the fully-connected network comprises two parts, wherein the first part is a vector
Figure BDA0002500722230000063
The other part is a vector
Figure BDA0002500722230000064
Wherein K is the number of the base groups in the step 3; the whole system comprises two hidden layers with 128 dimensions, the middle layer uses Relu function as an activation function, and the output layer uses Tanh function as a loss function;
3) d in the output of the fully-connected neural network is multiplied by all the convolved outputs in the generator, while S is multiplied by the basis M in the adaptive channel domain EM attention module (constructed in step 3).
Step 3, constructing an adaptive channel domain EM attention module, referring to fig. 4, corresponding to the process in the Gaussian mixture model, after a picture is sent to a generator in a convolutional neural network, a feature graph obtained through the output of a convolution block in the generator is X, the size of the feature graph is C × H × W, wherein C is the number of channels, and H and W are the height and the width of the feature graph respectively;
Figure BDA0002500722230000071
an N-dimensional vector representing an ith channel; given a
Figure BDA0002500722230000072
And initializing a group of K basis vectors by normal distribution random sampling
Figure BDA0002500722230000073
Composed matrix
Figure BDA0002500722230000074
Wherein K is less than N; step 3 is divided into the following three sub-steps; the first step is to estimate the hidden variables
Figure BDA0002500722230000075
The second step is to update the base vector matrix M by using the estimation result of the first step; the first step and the second step are iterated circularly until mu and Z converge; thirdly, reconstructing X by using M and Z, and multiplying M by using S obtained in the step 2;
and 4, step 4: an overall neural network structure;
embedding the adaptive channel domain EM attention module in the step 3 into the generator in the step 2, wherein the adaptive channel domain EM attention module is embedded in 3 different parts in total; first before the first residual network block after the second layer down-sampling volume block, second at the position of the 5 th residual network block for replacement, and third at the position of the first up-sampling volume block after the last residual network block for embedding; the overall network framework is shown in FIG. 1;
and 5: designing a loss function;
recording the shoe outline class picture as I in the picture acquired in the step 1AThe real shoe picture is IB(ii) a Randomly sampling the normal distribution to obtain a vector v, recording the generator and the full-connection network in the step 2 as G, and recording the discriminator as D; the generator input in G is IAThe input of the fully connected network is v, the two act together and the output is denoted as G (I)AV); the input of the discriminator is IBAnd G (I)AV) their outputs are denoted D (I) respectivelyB) And D (G (I)AV)). The network loss can be described as:
Figure BDA0002500722230000076
Figure BDA0002500722230000077
Figure BDA0002500722230000078
in order to be a loss function of the discriminator,
Figure BDA0002500722230000079
a loss function for the generator;
Figure BDA00025007222300000710
respectively represent a pair (I)AV) and IBCalculating expectation;
step 6: training a network, namely training the network by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, and if D is updated, updating the D alternately once every iteration, wherein 1000000 iteration times are adopted in actual training;
and 7: test phase, training model in step 6, only taking networkAnd a G part. Giving an input picture IAAnd different normal distribution samples v can obtain a plurality of output pictures with different styles, and the tests of the picture quality and the picture diversity are completed. According to an experimental result, on the edges2shoes data set, the picture quality score is improved by 0.15 point compared with the previous 10.32 points and reaches 10.47 points; the diversity score of the picture is improved by 0.005 point compared with the previous 0.109 point and reaches 0.114 point.
Further, the specific method of step 3 is as follows:
step 3.1: estimating hidden variables
Figure BDA0002500722230000081
This step is to calculate the degree of responsibility of each base for each channel, i.e. the probability that each channel belongs to each base; z is a radical ofckDenotes the kth base pair of the k channel xcK is more than or equal to 1 and less than or equal to K and C is more than or equal to 1 and less than or equal to C; construction conditions in μcX ofcThe posterior probability distribution of (a) is as follows:
Figure BDA0002500722230000082
wherein
Figure BDA0002500722230000083
The generalized kernel function is represented; then z isckThe following formula can be used for calculation:
Figure BDA0002500722230000084
kernel function
Figure BDA0002500722230000085
Selecting
Figure BDA0002500722230000088
For the t-th iteration, the hidden variable Z is calculated using the following formula:
Z(t)=softmax(X(M(t-1))T)
step 3.2: update basis vector μ: the step is obtained by maximizing the likelihood function of the complete data, corresponds to a Gaussian mixture model, and is to use the weight value calculated in the first step to carry out weighted summation on the samples and update the value of a base by using the possibility that the samples belong to a certain base; for the t-th iteration, the update of the basis vectors can be represented by a weighted sum of X as:
Figure BDA0002500722230000086
step 3.3: after step 3.1 and step 3.2 are performed alternately T times, step 3.3 is performed to reconstruct X with M and Z, and multiply μ with S obtained in step 2: the length of the S obtained in the step 2 is K and is equal to the number of the basal mu; we can finally perform the reconstruction of X using the following formula:
Figure BDA0002500722230000087
the picture size is as follows: 256*256*3
Learning rate: 0.0002 and decreases linearly with the number of iterations
Training batch size: 1
Iteration times are as follows: 1000000
Iteration times T of an adaptive channel domain EM attention module: 3.

Claims (2)

1. an adaptive attention mechanism-based multi-modal image style migration method, comprising:
step 1: preprocessing the data set;
acquiring an edges2shoes dataset, wherein the edges2shoes dataset comprises a shoe outline and real shoe pictures, and the total number of the obtained edges2shoes dataset is 49825 picture teams; classifying the data sets, wherein the shoe outlines are of one class, the real shoes are of the other class, and the sequential processing is randomly disturbed; finally, normalizing the pixel values of the picture to a range of [ -1,1 ];
step 2: constructing a convolution neural network and a full-connection neural network;
1) constructing a convolutional neural network comprising two sub-networks, one of which is a generator and the other is a discriminator; the input and output of the generator are pictures, while the input of the discriminator is a picture and the output is a scalar; the first two layers of the generator network are 2 down-sampling volume blocks, then 9 residual network blocks, and finally 2 up-sampling volume blocks; the discriminator network sequentially adopts 4 downsampling convolution blocks and two standard convolution blocks;
2) construction of a fully-connected network input vector of size 8-D
Figure FDA0002500722220000011
Figure FDA0002500722220000012
Representing dimensionality, assuming that the number of all channels of a generator in the constructed convolutional neural network is L, the output of the fully-connected network comprises two parts, wherein the first part is a vector
Figure FDA0002500722220000017
The other part is a vector
Figure FDA0002500722220000013
Wherein K is the number of the base groups in the step 3; the whole system comprises two hidden layers with 128 dimensions, the middle layer uses Relu function as an activation function, and the output layer uses Tanh function as a loss function;
step 3, constructing an adaptive channel domain EM attention module, corresponding to the process in a Gaussian mixture model, and after a picture is sent to a generator in a convolutional neural network, obtaining a characteristic diagram through the output of a convolution block in the generator, wherein the characteristic diagram is X, the size is C × H × W, the C is the channel number, and the H and the W are respectively the height and the width of the characteristic diagram;
Figure FDA0002500722220000014
an N-dimensional vector representing an ith channel; given a
Figure FDA0002500722220000018
And initializing a group of K basis vectors by normal distribution random sampling
Figure FDA0002500722220000015
Composed matrix
Figure FDA0002500722220000019
Wherein K is less than N; step 3 is divided into the following three sub-steps; the first step is to estimate the hidden variables
Figure FDA0002500722220000016
The second step is to update the base vector matrix M by using the estimation result of the first step; the first step and the second step are iterated circularly until mu and Z converge; thirdly, reconstructing X by using M and Z, and multiplying M by using S obtained in the step 2;
and 4, step 4: an overall neural network;
embedding the adaptive channel domain EM attention module in the step 3 into the generator in the step 2, wherein the adaptive channel domain EM attention module is embedded in 3 different parts in total; first before the first residual network block after the second layer down-sampling volume block, second at the position of the 5 th residual network block for replacement, and third at the position of the first up-sampling volume block after the last residual network block for embedding; multiplying the characteristic diagram control code d in the output of the fully-connected neural network into the output of all convolution layers in the generator, and multiplying the base control code S into the base M in the adaptive channel domain EM attention module obtained in the step 3; the output of the generator is used as the input of a discriminator, and the output of the discriminator is the output of the total neural network;
and 5: designing a loss function;
recording the shoe outline class picture as I in the picture acquired in the step 1AThe real shoe picture is IB(ii) a And randomly sampling the normal distribution to obtain a vector v, and the generator and the full link in the step 2Recording as G together with the network, and recording as D by the discriminator; the generator input in G is IAThe input of the fully connected network is v, the two act together and the output is denoted as G (I)AV); the input of the discriminator is IBAnd G (I)AV) their outputs are denoted D (I) respectivelyB) And D (G (I)AV)). The network loss can be described as:
Figure FDA0002500722220000021
Figure FDA0002500722220000022
Figure FDA0002500722220000026
in order to be a loss function of the discriminator,
Figure FDA0002500722220000027
a loss function for the generator;
Figure FDA0002500722220000028
Figure FDA0002500722220000029
respectively represent a pair (I)AV) and IBCalculating expectation;
step 6: training a total neural network, performing network training by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, and fixing the parameter of G when D is updated, wherein the D is updated alternately once every iteration;
and 7: in the testing stage, the model is trained in the step 6, and only the network G part is taken; giving an input picture IAAnd obtaining a plurality of output pictures with different styles by using different normal distribution samples v.
2. The method for multi-modal image style migration based on adaptive attention mechanism as claimed in claim 1, wherein the specific method of step 3 is:
step 3.1: estimating hidden variables
Figure FDA00025007222200000210
This step is to calculate the degree of responsibility of each base for each channel, i.e. the probability that each channel belongs to each base; z is a radical ofckDenotes the kth base pair of the k channel xcK is more than or equal to 1 and less than or equal to K and C is more than or equal to 1 and less than or equal to C; construction conditions in μcX ofcThe posterior probability distribution of (a) is as follows:
Figure FDA0002500722220000023
wherein
Figure FDA00025007222200000211
The generalized kernel function is represented; then z isckThe following formula can be used for calculation:
Figure FDA0002500722220000024
kernel function
Figure FDA00025007222200000212
Selection of exp (a)Tb) For the t-th iteration, the hidden variable Z is calculated using the following formula:
Z(t)=softmax(X(M(t-1))T)
step 3.2: for the t-th iteration, the update of the basis vectors can be represented by a weighted sum of X as:
Figure FDA0002500722220000025
step 3.3: after the step 3.1 and the step 3.2 are executed alternately for T times, the step 3.3 is carried out, X is reconstructed by M and Z, and the multiplication operation is carried out on mu by utilizing S obtained in the step 2; the length of the S obtained in the step 2 is K and is equal to the number of the basal mu; we can finally perform the reconstruction of X using the following formula:
Figure FDA0002500722220000031
CN202010431594.7A 2020-05-20 2020-05-20 Multi-modal image style migration method based on adaptive attention mechanism Expired - Fee Related CN111696027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010431594.7A CN111696027B (en) 2020-05-20 2020-05-20 Multi-modal image style migration method based on adaptive attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010431594.7A CN111696027B (en) 2020-05-20 2020-05-20 Multi-modal image style migration method based on adaptive attention mechanism

Publications (2)

Publication Number Publication Date
CN111696027A true CN111696027A (en) 2020-09-22
CN111696027B CN111696027B (en) 2023-04-07

Family

ID=72478084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010431594.7A Expired - Fee Related CN111696027B (en) 2020-05-20 2020-05-20 Multi-modal image style migration method based on adaptive attention mechanism

Country Status (1)

Country Link
CN (1) CN111696027B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614047A (en) * 2020-12-18 2021-04-06 西北大学 Facial makeup image style migration method based on TuiGAN improvement
CN112819692A (en) * 2021-02-21 2021-05-18 北京工业大学 Real-time arbitrary style migration method based on double attention modules
CN113379655A (en) * 2021-05-18 2021-09-10 电子科技大学 Image synthesis method for generating antagonistic network based on dynamic self-attention
CN113421318A (en) * 2021-06-30 2021-09-21 合肥高维数据技术有限公司 Font style migration method and system based on multitask generation countermeasure network
CN113450313A (en) * 2021-06-04 2021-09-28 电子科技大学 Image significance visualization method based on regional contrast learning
CN113538224A (en) * 2021-09-14 2021-10-22 深圳市安软科技股份有限公司 Image style migration method and device based on generation countermeasure network and related equipment
CN114037770A (en) * 2021-10-27 2022-02-11 电子科技大学长三角研究院(衢州) Discrete Fourier transform-based attention mechanism image generation method
CN114037600A (en) * 2021-10-11 2022-02-11 长沙理工大学 New cycleGAN style migration network based on new attention mechanism
CN115375601A (en) * 2022-10-25 2022-11-22 四川大学 Decoupling expression traditional Chinese painting generation method based on attention mechanism
CN117635418A (en) * 2024-01-25 2024-03-01 南京信息工程大学 Training method for generating countermeasure network, bidirectional image style conversion method and device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160169782A1 (en) * 2014-12-10 2016-06-16 Nikhilesh Chawla Fixture for in situ electromigration testing during x-ray microtomography
CN108564119A (en) * 2018-04-04 2018-09-21 华中科技大学 A kind of any attitude pedestrian Picture Generation Method
CN110110745A (en) * 2019-03-29 2019-08-09 上海海事大学 Based on the semi-supervised x-ray image automatic marking for generating confrontation network
CN110288609A (en) * 2019-05-30 2019-09-27 南京师范大学 A kind of multi-modal whole-heartedly dirty image partition method of attention mechanism guidance
CN110322423A (en) * 2019-04-29 2019-10-11 天津大学 A kind of multi-modality images object detection method based on image co-registration
CN110415184A (en) * 2019-06-28 2019-11-05 南开大学 A kind of multi-modality images Enhancement Method based on orthogonal first space
CN110503598A (en) * 2019-07-30 2019-11-26 西安理工大学 The font style moving method of confrontation network is generated based on condition circulation consistency
CN110580509A (en) * 2019-09-12 2019-12-17 杭州海睿博研科技有限公司 multimodal data processing system and method for generating countermeasure model based on hidden representation and depth
CN111161200A (en) * 2019-12-22 2020-05-15 天津大学 Human body posture migration method based on attention mechanism
CN111161272A (en) * 2019-12-31 2020-05-15 北京理工大学 Embryo tissue segmentation method based on generation of confrontation network

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160169782A1 (en) * 2014-12-10 2016-06-16 Nikhilesh Chawla Fixture for in situ electromigration testing during x-ray microtomography
CN108564119A (en) * 2018-04-04 2018-09-21 华中科技大学 A kind of any attitude pedestrian Picture Generation Method
CN110110745A (en) * 2019-03-29 2019-08-09 上海海事大学 Based on the semi-supervised x-ray image automatic marking for generating confrontation network
CN110322423A (en) * 2019-04-29 2019-10-11 天津大学 A kind of multi-modality images object detection method based on image co-registration
CN110288609A (en) * 2019-05-30 2019-09-27 南京师范大学 A kind of multi-modal whole-heartedly dirty image partition method of attention mechanism guidance
CN110415184A (en) * 2019-06-28 2019-11-05 南开大学 A kind of multi-modality images Enhancement Method based on orthogonal first space
CN110503598A (en) * 2019-07-30 2019-11-26 西安理工大学 The font style moving method of confrontation network is generated based on condition circulation consistency
CN110580509A (en) * 2019-09-12 2019-12-17 杭州海睿博研科技有限公司 multimodal data processing system and method for generating countermeasure model based on hidden representation and depth
CN111161200A (en) * 2019-12-22 2020-05-15 天津大学 Human body posture migration method based on attention mechanism
CN111161272A (en) * 2019-12-31 2020-05-15 北京理工大学 Embryo tissue segmentation method based on generation of confrontation network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LILI PAN等: "Latent Dirichlet Allocation in Generative Adversatial Networks", 《MACHINE LEARNING》 *
李泽田等: "基于局部期望最大化注意力的图像降噪", 《液晶与显示》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614047A (en) * 2020-12-18 2021-04-06 西北大学 Facial makeup image style migration method based on TuiGAN improvement
CN112614047B (en) * 2020-12-18 2023-07-28 西北大学 TuiGAN-based improved facial makeup image style migration method
CN112819692A (en) * 2021-02-21 2021-05-18 北京工业大学 Real-time arbitrary style migration method based on double attention modules
CN112819692B (en) * 2021-02-21 2023-10-31 北京工业大学 Real-time arbitrary style migration method based on dual-attention module
CN113379655B (en) * 2021-05-18 2022-07-29 电子科技大学 Image synthesis method for generating antagonistic network based on dynamic self-attention
CN113379655A (en) * 2021-05-18 2021-09-10 电子科技大学 Image synthesis method for generating antagonistic network based on dynamic self-attention
CN113450313A (en) * 2021-06-04 2021-09-28 电子科技大学 Image significance visualization method based on regional contrast learning
CN113450313B (en) * 2021-06-04 2022-03-15 电子科技大学 Image significance visualization method based on regional contrast learning
CN113421318A (en) * 2021-06-30 2021-09-21 合肥高维数据技术有限公司 Font style migration method and system based on multitask generation countermeasure network
CN113538224B (en) * 2021-09-14 2022-01-14 深圳市安软科技股份有限公司 Image style migration method and device based on generation countermeasure network and related equipment
CN113538224A (en) * 2021-09-14 2021-10-22 深圳市安软科技股份有限公司 Image style migration method and device based on generation countermeasure network and related equipment
CN114037600A (en) * 2021-10-11 2022-02-11 长沙理工大学 New cycleGAN style migration network based on new attention mechanism
CN114037770A (en) * 2021-10-27 2022-02-11 电子科技大学长三角研究院(衢州) Discrete Fourier transform-based attention mechanism image generation method
CN114037770B (en) * 2021-10-27 2024-08-16 电子科技大学长三角研究院(衢州) Image generation method of attention mechanism based on discrete Fourier transform
CN115375601A (en) * 2022-10-25 2022-11-22 四川大学 Decoupling expression traditional Chinese painting generation method based on attention mechanism
CN117635418A (en) * 2024-01-25 2024-03-01 南京信息工程大学 Training method for generating countermeasure network, bidirectional image style conversion method and device
CN117635418B (en) * 2024-01-25 2024-05-14 南京信息工程大学 Training method for generating countermeasure network, bidirectional image style conversion method and device

Also Published As

Publication number Publication date
CN111696027B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN111696027B (en) Multi-modal image style migration method based on adaptive attention mechanism
Putzky et al. Recurrent inference machines for solving inverse problems
EP3298576B1 (en) Training a neural network
US20190087726A1 (en) Hypercomplex deep learning methods, architectures, and apparatus for multimodal small, medium, and large-scale data representation, analysis, and applications
CN109711426B (en) Pathological image classification device and method based on GAN and transfer learning
CN114187331B (en) Unsupervised optical flow estimation method based on Transformer feature pyramid network
Sprechmann et al. Supervised sparse analysis and synthesis operators
CN113379655B (en) Image synthesis method for generating antagonistic network based on dynamic self-attention
EP4341914A1 (en) Generating images using sequences of generative neural networks
Hu et al. Image super-resolution with self-similarity prior guided network and sample-discriminating learning
CN114648787A (en) Face image processing method and related equipment
CN111986085A (en) Image super-resolution method based on depth feedback attention network system
CN116797456A (en) Image super-resolution reconstruction method, system, device and storage medium
CN114037770B (en) Image generation method of attention mechanism based on discrete Fourier transform
Huang et al. Learning deep analysis dictionaries for image super-resolution
Fakhari et al. A new restricted boltzmann machine training algorithm for image restoration
Gao et al. Rank-one network: An effective framework for image restoration
Moeller et al. Image denoising—old and new
Zhao et al. Face super-resolution via triple-attention feature fusion network
Gangloff et al. A general parametrization framework for pairwise Markov models: An application to unsupervised image segmentation
CN115601787A (en) Rapid human body posture estimation method based on abbreviated representation
CN115410000A (en) Object classification method and device
CN115601257A (en) Image deblurring method based on local features and non-local features
Zandavi Post-trained convolution networks for single image super-resolution
Basioti et al. Image restoration from parametric transformations using generative models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20230407

CF01 Termination of patent right due to non-payment of annual fee