CN113378949A

CN113378949A - Dual-generation confrontation learning method based on capsule network and mixed attention

Info

Publication number: CN113378949A
Application number: CN202110690163.7A
Authority: CN
Inventors: 王蒙; 陈家兴; 王强; 李鑫凯; 邵逸轩
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2021-09-10

Abstract

The invention relates to a dual generation confrontation learning method based on a capsule network and mixed attention, and belongs to the field of artificial intelligence and image processing. The invention relates to an image generation method combining capsule network, self-attention, mixed attention and confrontation learning. And the sample space generator in the sample space self-countermeasure module is a depth generation model based on mixed attention, and the LeNet-5 network is used for reference by the sample space discriminator. The invention improves the operation accuracy and the working efficiency of the counterstudy in the task of generating the clear images by training few samples, reduces the training time and the number of training samples, has stronger generalization capability and verifies the effectiveness of the model on MNIST and other reference data sets.

Description

Dual-generation confrontation learning method based on capsule network and mixed attention

Technical Field

The invention relates to a method for generating a clear image by using few samples, in particular to a dual generation counterstudy method based on a capsule network and mixed attention, and belongs to the field of artificial intelligence and image processing.

Background

Image generation is an important issue in computer vision. In computer vision research over the years, attention mechanisms have been extensively studied and have been used to improve the performance of modern deep neural networks. Attention mechanisms have proven useful in a variety of computer vision tasks, such as image generation and image classification.

Much recent work has proposed using either channel attention or spatial attention, or both, to improve the performance of these neural networks. These attention mechanisms have the function of improving the representation of features generated by standard convolutional layers by establishing correlations between channels or weighted spaces (for spatial attention). The intuition behind learning attention weights is to enable the network to learn where to engage in training and to focus further on the target object. This idea is further advanced by introducing a convolution with a kernel of large size to introduce spatial information encoding.

Hinton et al in 2017 proposed a capsule network and a dynamic routing algorithm that was applied to the primary capsule to digital capsule update and successfully applied to the identification of MNIST datasets. Then, a matrix capsule structure is proposed, which adopts a matrix to express the posture between objects, and adopts an EM algorithm to perform dynamic update between capsules. Recently, the efficiency-capsNet of the attention module is added between the capsule layers, so that the number of primary capsules in the capsule network is reduced to 2% of the original number, and the effectiveness of the attention module on the improvement of the capsule network performance is proved.

At the same time, with the advent of the generation of the antagonistic learning network GAN, a significant advance has been made in the direction of the task of image generation (Goodfellow et al, 2014), but there are still many problems unsolved. GAN based on deep convolutional networks is particularly successful. However, by careful inspection of these generated samples, while the advanced ImageNet GAN model is adept at generating image classes with less structural constraints (e.g., ocean, sky, and landscape classes, which are more differentiated by texture than geometry), it cannot capture geometric or structural patterns that persist in certain classes.

The generation of the countermeasure network comprises two models, generator G and discriminator D respectively, whose training is performed simultaneously: maximizing the probability of correct labeling of training samples and samples from G by training D; while the parameters of generator G are adjusted by minimizing log (1-D (G (z)).

In unconditional generators, the pattern in which the data is generated is not controllable. However, in the case of data tagged, it may be convenient to use the tag as a condition for generating network input, such as CGAN. The idea is to decompose the noise source into an incompressible source and a potential code by a variational self-encoder, trying to find the potential factors of the feature variations by maximizing the interaction information of the variational self-encoder and the generator. Such latent coding can be used to discover object classes in an unsupervised manner, with learned representations having rich semantic information that can handle complex, inter-interlaced factors in image appearance, including pose, lighting, and changes in the emotional content of facial images.

Variational autoencoders and generative countermeasure networks have matured increasingly, but they all have their own advantages and disadvantages. The variational self-encoder is high in training speed and stable, but the generated image is fuzzy; the problems of unstable training, mode collapse, incomplete extraction of intermediate characteristic information, loss of characteristic information and the like often exist in the generation of the countermeasure network. Capsule networks are also always used for classification tasks, but their effectiveness in generating tasks is well documented. In future work, the above-mentioned several deep network models aim at better mutual combination, making up for the weakness, and thus can break their respective limitations.

Disclosure of Invention

The invention aims to provide a dual generation confrontation learning method based on capsule network and mixed attention in the task of generating clear images by training few samples, aiming at the defects and shortcomings of the prior art.

The technical scheme adopted by the invention is as follows: a dual generation confrontation learning method based on capsule network and mixed attention comprises a self-encoder module E based on self-attention and capsule network, a vector space self-confrontation module and a sample space self-confrontation module;

the self-encoder module E based on the self-attention and capsule network comprises a self-encoder module input layer, a parallel convolution layer, a self-attention layer, a primary capsule layer and a final capsule layer;

vector space self-confrontation module comprising a vector space generator G_ASum vector space discriminator D_A；

Sample space self-confrontation module comprising a sample space generator G_BAnd a sample space discriminator D_B；

On the premise of generating a basic confrontation learning model, the self-encoder module based on self-attention and capsule network is applied to the construction of a real vector space, so that the accuracy and the efficiency of image feature information extraction are improved for an image generation task, and the number of required training samples is reduced. The invention also adds a vector space self-countermeasure module and a sample space self-countermeasure module, and improves the robustness of the whole framework and the definition and the reality of the finally generated virtual image.

And, the basic architecture of the sample space generator and the sample space discriminator is improved: the sample space generator introduces a channel attention module and a space attention module in the feature mapping process, and focuses on information of different channels of the convolutional layer and space information of the convolutional layer respectively, so that the loss and the loss of the information in the model training process are reduced, and the stability and the accuracy of feature extraction are improved; the sample space discriminator uses LeNet-5 network for discrimination, wherein the first-stage and second-stage convolution pooling operations aim to improve the accuracy of discrimination.

The overall method architecture is shown in fig. 1, and the total training loss function L is:

L_Eas a loss function from the encoder, L_GAAnd L_DAAre respectively vector space generators G_AAnd a discriminator D_ALoss function of L_GBAnd L_DBFor a sample space generator G_BAnd a discriminator D_BThe training loss function of (1). The method comprises the following specific steps:

(1) preprocessing a real picture input from an input layer of an encoder module, and randomly sampling and extracting random noise z and a sample label L from the characteristic distribution of the real picture;

(2) the self-encoder module E encodes the real picture to finally obtain a real vector space Z consisting of the final capsule layer_eAnd input to a vector space discriminator D_AGenerator of vector space G_AGenerating a virtual vector space Z close to reality according to the random noise Z extracted in the step (1) and the sample label L_aAnd input to a vector space discriminator D_AAnd a sample space generator G_B；

(3) Vector space discriminator D_AJudging whether the vector space Z input to the step (2) is the real vector space Z_eOr virtual vector space Z_aAnd feeds back the judgment result to the vector space discriminator D_ASum vector space generator G_A；

(4) Sample space generator G_BAccording to the virtual vector space Z input in the step (2)_aAnd (2) generating a virtual image by the sample label L extracted in the step (1), and inputting the virtual image to a sample space discriminator D_BAnd, the real picture in step (1) is also input to the sample space discriminator D_B；

(5) Sample space discriminator D_BJudging whether the image input in the step (4) is a real image or a virtual image, and feeding back the judgment result to a sample space discriminator D_BAnd a sample space generator G_B。

As described in the background, the basic generation countermeasure network is actually the training process of the cost function V (D, G) based infinitesimal game of the generator G and the discriminator D:

wherein, P_dataFor the feature distribution of the input sample, P_zIs the distribution of random noise, x is the real picture, z is the random noise, E_z～PzDenotes z corresponds to P_zDistribution of (E), E_x～PdataIndicates that x corresponds to P_dataDistribution of (2).

However, by adding additional information to adjust the model, the generation process of the data can be guided. The generation of the countermeasure network can be popularized to a condition generation model, and the precondition generator and the discriminator both have certain additional information y. y may be any type of auxiliary information that is input to the arbiter and generator as a control condition by taking y as an additional input layer. The objective function of the conditional generation countermeasure network model is then:

that is, the training loss function formulas of all generators and discriminators involved in the present invention are based on the above theory and formulas.

Further, the self-encoder module E in step (2) is composed of an attention-based capsule network, and the architecture thereof is shown in fig. 2 in detail. The specific operation steps comprise:

(1.1) inputting a real picture from an input layer of an encoder module;

(1.2) carrying out parallel convolution operation on the real picture through the parallel convolution layer to obtain a characteristic diagram of the real picture;

(1.3) repeatedly extracting and compressing the information in the characteristic diagram in the step (1.2) through a self-attention layer, and outputting the result as a primary capsule layer;

(1.4) further compressing the primary capsule layer to a final capsule layer through compression operation, and enabling a real vector space Z formed by the final capsule layer to be a real vector space Z_eInput direction of feedVolume space discriminator D_A。

The parallel convolution layers of the module respectively adopt convolution kernels of 3 x 3, 5 x 5, 7 x 7 and 9 x 9 to carry out parallel convolution operation, position information of a real picture on different resolutions can be obtained by non-parallel convolution operation of 4 different convolution kernels, and on the other hand, the model training speed can be accelerated, and parameters and complexity of a network are reduced. Obtaining 256 4 × 4 feature maps after parallel convolution layers, wherein each 64 feature maps are picture space position information obtained by convolution kernels with the same size, and the operation formula is as follows:

T＝ReLU(Conv^k×k(x))

wherein x is a real picture, ReLU is an activation function, Conv is convolution operation, k represents the size of a convolution kernel, and T represents a feature diagram obtained after parallel convolution operation.

Then, 64 feature maps obtained by the same convolution kernel convolution operation are a group (4 groups in total), feature information extraction and compression are performed on the feature maps of each group n times through a self-Attention module (Attention), and a result feature matrix obtained each time is accumulated to obtain a primary capsule layer, so that the formula is as follows:

T_n＝Attention(T_n-1)

wherein the Attention means the self-Attention module, T_nRepresents the result obtained after the n-th extraction and compression, T_nThe results obtained after the n-1 th extraction and compression are shown, i is 1, 2, …, n-1, n, and U is the primary capsule layer.

Then, the primary capsule layer is subjected to compression operation to obtain a final capsule layer to form an output layer Z_eAnd high-level characteristics of the corresponding category, such as posture, direction, thickness and the like, are stored in each capsule in the final capsule layer. The compression operation formula is:

in the overall model, the task of the self-encoder module is to map the samples of each class into a potential vector space. The potential vector space can express the characteristics of each class sample to the maximum extent. To achieve this, the self-encoder model uses a method similar to training a classifier, i.e., minimizing the edge loss function L_aTo train:

L_a＝T_amax(0，m⁺-||v_a||)²+λ(1-T_a)max(0，||v_a||-m^-)

T_aindicating whether the predicted class is the current class, m⁺For the upper bound of the loss, 0.9 m is taken^-To lower bound of loss, take 0.1, v_aFor the representative vector of the a-th category, the length is taken to represent the existing probability, and finally, the ratio between the two is balanced by a hyperparameter lambda. Loss L from the encoder_EFor the average loss for each class, the formula is as follows:

further, a vector space generator G in the vector space self-countermeasure module_ASum vector space discriminator D_ASee fig. 3 and 4 for details of their training loss function L_GAAnd L_DAThe formulas are respectively as follows:

where x is the real picture, z is random noise, P_rIs a distribution of data samples, P_zFor the distribution of random noise, E (x) is the result of the real picture passing through the self-encoder module, E_z～PzDenotes z corresponds to P_zDistribution of (E), E_x～PrIndicates that x corresponds to P_rDistribution of (2).

The vector space generator G in the step (2)_AThe neural network comprises a vector space generator input layer, three vector space generator hidden layers and a linear output layer, wherein the three vector space generator hidden layers are all full-connection networks activated by a Leaky ReLU activation function, and the number of neurons is 512, 1024 and 512 respectively.

The vector space discriminator D in the step (3)_AThe neural network comprises a vector space discriminator input layer, three vector space discriminator hidden layers and a nonlinear output layer, wherein the three vector space discriminator hidden layers are all fully-connected networks activated by a BN ReLU activation function, and the number of neurons is 512, 1024 and 512 respectively.

Further, a sample space generator G in the sample space self-confrontation module_BAnd a sample space discriminator D_BSee fig. 5 and 6 for details of their training loss function L_GBAnd L_DBThe formulas are respectively as follows:

where z is random noise.

The sample space generator G in the step (4)_BThe system comprises a convolution layer with convolution kernel of 1 multiplied by 1, a mixed attention module and an deconvolution layer, wherein the mixed attention module comprises a channel attention module and a space attention module. The specific operation steps are as follows:

(2.1) virtual vector space Z_aObtaining a characteristic diagram F through the convolution layer;

(2.2) carrying out tensor multiplication on the result obtained by the characteristic diagram F through the channel attention module and the characteristic diagram F to obtain the characteristic diagram F_c(ii) a Feature map F_cObtained after passing through the spatial attention moduleResults, and feature map F_cCarrying out tensor multiplication to obtain F_s；

(2.3) carrying out tensor multiplication on the result obtained after the sample label L is subjected to reshape reconstruction calculation and the feature map F, and then sequentially carrying out tensor multiplication on the result and the feature map F_cAnd feature map F_sPerforming matrix addition to obtain a feature map F^*；

(2.4) feature map F^*Obtaining a virtual image by the deconvolution layer, and inputting the virtual image to a sample space discriminator D_B。

The operation formula of the step (2.2) is as follows:

wherein,

representing tensor multiplication, M_cIndicating the channel attention Module, M_sThe spatial attention module is denoted and σ denotes the sigmoid function. The operation formula of the step (2.3) is

Wherein,

representing a matrix addition.

The sample space discriminator D in the step (5)_BThe device comprises a first-level convolution layer, a first-level pooling layer, a second-level convolution layer, a second-level pooling layer, a full-connection layer and a sample space discriminator output layer. The specific operation steps are as follows:

(3.1) sequentially passing the real picture or the virtual image through a first convolution layer and a first pooling layer to obtain a characteristic diagram y;

(3.2) the characteristic diagram y sequentially passes through the secondary convolution layer and the secondary pooling layer, and the obtained result passes through the output layer of the sample space discriminator to output a result y_b。

The operation formulas of the step (3.1) and the step (3.2) are respectively as follows:

y^*＝MaxPool(Conv(p)))

y_b＝σ(MaxPool(Conv(y^*)))

wherein, MaxPool is the pooling operation, and p is the input real picture or virtual image. The first and second convolution pooling operations improve the accuracy and effectiveness of the discrimination.

The invention has the beneficial effects that: the self-encoder module based on self-attention and capsule network is applied to the construction of a real vector space, so that the accuracy and the efficiency of image characteristic information extraction are improved for an image generation task, and the number of required training samples is reduced; the method also adds a vector space self-countermeasure module and a sample space self-countermeasure module, improves the robustness of the whole framework and the definition and the reality of the finally generated virtual image; the sample space generator introduces a channel attention module and a space attention module in the feature mapping process, and focuses on information of different channels of the convolutional layer and space information of the convolutional layer respectively, so that the loss and the loss of the information in the model training process are reduced, and the stability and the accuracy of feature extraction are improved; the sample space discriminator uses LeNet-5 network for discrimination, wherein the first-stage and second-stage convolution pooling operations improve the accuracy of discrimination.

Drawings

FIG. 1 is a block diagram of a dual generative antagonistic learning method based on capsule networking and mixed attention;

FIG. 2 is attention auto-encoder E;

FIG. 3 is a potential vector space generator G_A；

FIG. 4 is a diagram of a potential vector space discriminator D_A；

FIG. 5 is a sample space generator G_B；

FIG. 6 is a sample space discriminator D_B；

Fig. 7 shows the results of comparison experiments performed by the present invention with other advanced warfare learning networks using MNIST data sets as examples.

Detailed Description

Example 1: the invention is further described with reference to the figures and training on the MNIST data set. A dual generation confrontation learning method based on capsule network and mixed attention comprises a self-encoder module E based on self-attention and capsule network, a vector space self-confrontation module and a sample space self-confrontation module;

Sample space self-confrontation module comprising a sample space generator G_BAnd a sample space discriminator D_B。

Fig. 1 is a framework diagram of an operation flow according to an embodiment of the present invention, where the method includes the following steps:

Further, the self-encoder module E in step (2) is based on self-attention and capsule network, and its architecture is shown in fig. 2. The self-encoder module is reasonably applied to the construction of a real vector space, so that the accuracy and the efficiency of extracting image characteristic information are improved for an image generation task, and the quantity of required training samples is reduced. The specific operation steps comprise:

(1.1) inputting a real picture from an input layer of an encoder module;

(1.4) further compressing the primary capsule layer to a final capsule layer through a compression function squash, and enabling a real vector space Z consisting of the final capsule layer to be in a space Z_eInput to vector space discriminator D_A。

In the step (1.2), the parallel convolution layers of the module respectively adopt convolution kernels of 3 × 3, 5 × 5, 7 × 7 and 9 × 9 to perform parallel convolution operation, and the position information of the real picture on different resolutions can be obtained by performing non-parallel convolution operation on 4 different convolution kernels, so that the model training speed can be increased, and the parameters and the complexity of the network can be reduced. Obtaining 256 4 × 4 feature maps after parallel convolution layers, wherein each 64 feature maps are picture space position information obtained by convolution kernels with the same size, and the operation formula is as follows:

T＝ReLU(Conv^k×k(x))

Then, in step (1.3), 64 feature maps obtained through the same convolution kernel convolution operation are a group (4 groups in total), feature information extraction and compression are performed on the feature maps of each group n times through a self-Attention module (Attention), and a result feature matrix obtained each time is accumulated to obtain a primary capsule layer, so that the formula is as follows:

T_n＝Attention(T_n-1)

Then, in the step (1.4), the primary capsule layer is subjected to compression operation to obtain a final capsule layer to form an output layer Z_eAnd high-level characteristics of the corresponding category, such as posture, direction, thickness and the like, are stored in each capsule in the final capsule layer. The compression operation formula is:

L_a＝T_amax(0，m⁺-||v_a||)²+λ(1-T_a)max(0，||v_a||-m^-)

the invention also adds a vector space self-countermeasure module and a sample space self-countermeasure module, and improves the robustness of the whole framework and the definition and the reality of the finally generated virtual image. And, the basic architecture of the sample space generator and the sample space discriminator is improved: the sample space generator introduces a channel attention module and a space attention module in the feature mapping process, and focuses on information of different channels of the convolutional layer and space information of the convolutional layer respectively, so that the loss and the loss of the information in the model training process are reduced, and the stability and the accuracy of feature extraction are improved; the sample space discriminator uses LeNet-5 network for discrimination, wherein the first-stage and second-stage convolution pooling operations aim to improve the accuracy of discrimination.

Further, the vector space generator G in step (2)_AThe neural network comprises a vector space generator input layer, three vector space generator hidden layers and a linear output layer, wherein the three vector space generator hidden layers are all full-connection networks activated by a Leaky ReLU activation function, and the number of neurons is 512, 1024 and 512 respectively.

The vector space discriminator D in the step (3)_AThe device comprises a vector space discriminator input layer, three vector space discriminator hidden layers and a nonlinear output layer, wherein the three vector space discriminator hidden layers are all a full-connection network activated by a BN ReLU activation function, and neurons are arrangedThe numbers are 512, 1024 and 512 respectively.

Vector space generator G in vector space self-countermeasure module_ASum vector space discriminator D_ASee fig. 3 and 4 for details of their training loss function L_GAAnd L_DAThe formulas are respectively as follows:

Further, the sample space generator G in step (4)_BThe system comprises a convolution layer with convolution kernel of 1 multiplied by 1, a mixed attention module and an deconvolution layer, wherein the mixed attention module comprises a channel attention module and a space attention module. The specific operation steps are as follows:

(2.2) carrying out tensor multiplication on the result obtained by the characteristic diagram F through the channel attention module and the characteristic diagram F to obtain the characteristic diagram F_c(ii) a Feature map F_cThe results obtained by the spatial attention module, together with the feature map F_cCarrying out tensor multiplication to obtain F_s；

(2.4) feature map F^*Obtaining a virtual image by the deconvolution layer, and inputting the virtual image to the sampleThis space discriminator D_B。

The operation formula of the step (2.2) is as follows:

wherein,

Wherein,

representing a matrix addition.

(3.1) the real picture or the virtual image sequentially passes through the first-level convolution layer and the first-level pooling layer to obtain a characteristic diagram y^*；

(3.2) feature map y^*Sequentially passing through a second convolution layer and a second pooling layer, and outputting the obtained result y through an output layer of a sample space discriminator_b。

y^*＝MaxPool(Conv(p)))

y_b＝σ(MaxPool(Conv(y^*)))

Sample space generator G in sample space self-confrontation module_BAnd a sample space discriminator D_BSee fig. 5 and 6 for details of their training loss function L_GBAnd L_DBThe formulas are respectively as follows:

where z is random noise.

Finally, the final training loss function L of the overall model is:

the invention has wide application fields, for example, when the classification problem on a production line is processed, the number of samples provided by manufacturers is often insufficient, and a new sample image is required to be correspondingly generated according to the existing samples. The method has low requirement on the number of samples, and the added self-attention module reduces the number of capsules in a capsule network, even reduces the number of capsules to 2% of the original number in efficiency-CapsNet, thereby greatly improving the working efficiency of an encoder; the method has more control, pertinence and accuracy on the generation of the image, and can generate the image similar to a sample provided by a manufacturer better and more clearly.

In the experimental process, a system Ubuntu 18.04 is used, a hardware CPU is AMD Ryzen 52600Six-Core Processor 3.85GHz, a programming language is Python 3.6, a video card is Inviet GeForce RTX 2070super, and a deep learning frame is Pyorch 1.2. The MNIST data set is divided into a training set and a testing set, each of which comprises 10 classes of handwritten numbers, the range of the handwritten numbers is 0 to 9, namely sample labels are obtained, and the size of each sample image is 28 pixels by 28 pixels. The results of the comparative experiments conducted by the present invention and other advanced antagonistic learning networks using the MNIST dataset as an example are shown in fig. 7, and the evaluation parameters of the comparative experiments are as follows:

wherein: IS the inclusion Score, used to assess the quality (sharpness) of the generated image, with larger values being better; the FID is Frechet inclusion Distance and is used for evaluating the diversity of generated images, and the smaller the value, the better the value.

In summary, the dual generation antagonistic learning method based on the capsule network and the mixed attention according to the embodiment of the invention is a novel dual antagonistic learning method with self attention, capsule network and mixed attention. Unlike previous methods, the method uses a hybrid attention module to weight the extracted features so that the impact of the non-transferable features can be effectively clarified. The method further extracts complex multimodal structure information from the features extracted from the whole image by considering the transferability of different regions or resolutions so as to realize finer image generation. Various comparison experiments and ablation experiments on the reference data set also show the feasibility and effectiveness of the method.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A dual generation confrontation learning method based on capsule network and mixed attention is characterized in that: the system comprises a self-encoder module E based on self-attention and capsule network, a vector space self-confrontation module and a sample space self-confrontation module;

The method comprises the following specific steps:

(5) Sample space discriminator D_BInput to itself in the judgment step (4)Whether it is a real picture or a virtual image, and feeds back the determination result to the sample space discriminator D_BAnd a sample space generator G_B。

2. The dual generation antagonistic learning method based on capsule networking and mixed attention of claim 1, characterized in that: the self-encoder module E based on self-attention and capsule network specifically comprises the following operation steps:

(1.1) inputting a real picture from an input layer of an encoder module;

(1.4) further compressing the primary capsule layer to a final capsule layer through a compression operation S, and enabling a real vector space Z formed by the final capsule layer to be a real vector space Z_eInput to vector space discriminator D_A；

In the step (1.2), 3 × 3, 5 × 5, 7 × 7 and 9 × 9 convolution kernels are respectively adopted to perform parallel convolution operation, and the operation formula is as follows:

T＝ReLU(Conv^k×k(x))

wherein x is a real picture, ReLU is an activation function, Conv is convolution operation, k represents the size of a convolution kernel, and T represents a feature diagram obtained by parallel convolution operation in the step (1.2);

in step (1.3), the operation formula for repeatedly extracting and compressing the feature map obtained in step (1.2) from the attention layer is as follows:

T_n＝Attention(T_n-1)

wherein the Attention means the self-Attention module, T_nRepresents the result obtained after the n-th extraction and compression, T_nIndicates the (n-1) th extraction and pressureThe results obtained after shrinkage, i ═ 1, 2, …, n-1, n, U denotes the primary capsule layer;

the compression operation formula in step (1.4) is as follows:

3. the dual generation antagonistic learning method based on capsule networking and mixed attention of claim 1, characterized in that: vector space generator G of vector space self-countermeasure module_ASum vector space discriminator D_ATraining loss function L_GAAnd L_DAThe formulas are respectively as follows:

4. The dual generation antagonistic learning method based on capsule networking and mixed attention of claim 1, characterized in that: vector space generator G_AThe neural network comprises a vector space generator input layer, three vector space generator hidden layers and a linear output layer, wherein the three vector space generator hidden layers are all full-connection networks activated by a Leaky ReLU activation function, and the number of neurons is 512, 1024 and 512 respectively.

5. The dual generation antagonistic learning method based on capsule networking and mixed attention of claim 1, characterized in that: vector space discriminator D_AThe neural network comprises a vector space discriminator input layer, three vector space discriminator hidden layers and a nonlinear output layer, wherein the three vector space discriminator hidden layers are all fully-connected networks activated by a BN ReLU activation function, and the number of neurons is 512, 1024 and 512 respectively.

6. The dual generation antagonistic learning method based on capsule networking and mixed attention of claim 1, characterized in that: sample space generator GB of sample space self-countermeasure module and training loss function L of sample space discriminator DB_GBAnd L_DBThe formulas are respectively as follows:

where x is the real picture, z is random noise, P_rIs a distribution of data samples, P_zFor the distribution of random noise, E_z～PzDenotes z corresponds to P_zDistribution of (E), E_x～PrIndicates that x corresponds to P_rDistribution of (2).

7. The dual generation antagonistic learning method based on capsule networking and mixed attention of claim 1, characterized in that: sample space generator G_BThe method comprises a convolution layer with convolution kernel of 1 multiplied by 1, a mixed attention module and an deconvolution layer, wherein the mixed attention module comprises a channel attention module and a space attention module, and the specific operation steps are as follows:

(2.2) feature map F is obtained by the channel attention moduleThe obtained result is subjected to tensor multiplication with the characteristic diagram F to obtain the characteristic diagram F_c(ii) a Feature map F_cThe results obtained by the spatial attention module, together with the feature map F_cCarrying out tensor multiplication to obtain F_s；

(2.4) feature map F^*Obtaining a virtual image by the deconvolution layer, and inputting the virtual image to a sample space discriminator D_B；

The operation formula of the step (2.2) is as follows:

wherein,

representing tensor multiplication, M_cIndicating the channel attention Module, M_sRepresenting a spatial attention module, and sigma representing a sigmoid function;

the operation formula of the step (2.3) is

Wherein,

representing a matrix addition.

8. The capsule network and mixed attention based of claim 1The dual generation antagonistic learning method is characterized in that: sample space discriminator D_BIncluding one-level convolution layer, one-level pooling layer, second grade convolution layer, second grade pooling layer, full tie layer and sample space discriminator output layer, concrete operation step is:

(3.2) feature map y^*Sequentially passing through a second convolution layer and a second pooling layer, and outputting the obtained result y through an output layer of a sample space discriminator_b；

y^*＝MaxPool(Conv(p)))

y_b＝σ(MaxPool(Conv(y^*)))

wherein, MaxPool is the pooling operation, and p is the input real picture or virtual image.