CN113379655B

CN113379655B - Image synthesis method for generating antagonistic network based on dynamic self-attention

Info

Publication number: CN113379655B
Application number: CN202110537516.XA
Authority: CN
Inventors: 王博文; 潘力立; 李宏亮; 孟凡满; 吴庆波; 许林峰
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2022-07-29
Anticipated expiration: 2041-05-18
Also published as: CN113379655A

Abstract

The invention discloses an image synthesis method based on a dynamic self-attention generation antagonistic network, and belongs to the field of computer vision. The method comprises the steps of firstly selecting a generated countermeasure network as a basic frame, normalizing a training picture, and sampling a normal distribution to obtain a noise sample. The invention uses Linformer algorithm and dynamic convolution algorithm for reference, improves the multi-head self-attention mechanism used in the invention, and increases the connection and constraint between each self-attention head, so that the self-attention heads can learn various mode knowledge of the image. The invention fully exerts the advantages of the dynamic self-attention mechanism and the generation of the confrontation network, and the proposed dynamic self-attention module can greatly reduce the calculation complexity of the multi-head self-attention mechanism and improve the problems of mode collapse, unstable training and the like of the generation of the confrontation network.

Description

Image synthesis method for generating antagonistic network based on dynamic self-attention

Technical Field

The invention belongs to the field of computer vision, and mainly relates to the problem of image synthesis; the method is mainly applied to the fields of image restoration, editing, enhancement, retrieval and the like.

Background

Image synthesis is a technique of understanding image contents using computer vision techniques and generating a specified image as needed. It can be generally divided into two types: unsupervised image synthesis and supervised image synthesis. Unsupervised image synthesis refers to learning a mapping function from a noise distribution to an image distribution, and synthesizing an image by the mapping function. Supervised image synthesis refers to learning the condition distribution of image data and further generating an image under a given condition. Image synthesis, which is a hot problem in the field of computer vision, is the basis for image restoration, editing and enhancement. The method can solve the problem of visual data loss in the fields of military affairs, medical treatment, safety and the like, and can be applied to the fields of film and television entertainment, planar design and the like.

Since human beings are sensitive to information such as details and edges of images, image synthesis algorithms need to guarantee authenticity and diversity of image synthesis. To improve the realism and diversity of composite images, many scholars use depth generation techniques to improve past image synthesis algorithms. However, when the target data distribution is very complex, early depth generation methods often face the problems of large computation and difficulty in solving. The method of creating a competing network proposed by Goodfellow et al in 2014 solved this problem excellently. Compared with the prior deep generation method, the generation of the countermeasure network has the following obvious advantages: 1. the method for generating the confrontation network can generate samples with larger dimensions only by increasing the output dimension of the generator and the input dimension of the discriminator. 2. The generation of the countermeasure network does not make any prior assumptions about the data distribution, and therefore does not require manual design of the distribution of the model. 3. The data distribution synthesized by the method for generating the countermeasure network is very close to the data distribution of a real sample, and the authenticity and diversity of the synthesized image can be well ensured. Because of these obvious advantages of creating a competing network, the present invention performs the image synthesis task using the method of creating a competing network.

At present, the existing method for generating the confrontation network still has the problems of mode collapse, unstable training and the like. To improve these problems, Goodfellow et al model long-range correlations between synthesized pixels by introducing a non-local Self-attention mechanism, and their proposed Self-attention mechanism makes a very large breakthrough in image synthesis tasks in various fields against the network (SAGAN). Reference: H.Zhang, I.Goodfellow, D.Metaxas, et al.self-authentication genetic additive networks [ C ]. International conference on machine learning,2019, 7354-7363. However, the model has the problems of high computational complexity, low computational efficiency, poor long-range correlation performance among modeling pixels and the like. The invention provides an image synthesis method for generating an antagonistic network based on dynamic self-attention by using a Linformer algorithm and a dynamic convolution algorithm on the basis of an SAGAN model, and obtains an excellent result. Reference documents: chen, x.dai, m.liu, et al.dynamic restriction: the IEEE Conference on Computer Vision and Pattern Recognition, 2020, 11030-.

Disclosure of Invention

The invention relates to an image synthesis method for generating an antagonistic network based on dynamic self-attention, which solves the problems of high computational complexity, low computational efficiency, poor long-range correlation performance among modeling pixels and the like in the conventional method for generating the antagonistic network based on a self-attention mechanism. The method comprises the steps of firstly selecting and using a generated countermeasure network as a basic frame, normalizing a training picture, and sampling a normal distribution to obtain noise. Meanwhile, the invention uses Linformer algorithm and dynamic convolution algorithm for reference, improves the multi-head self-attention mechanism used in the invention, and increases the connection and constraint between each self-attention head, so that the self-attention heads can learn various mode knowledge of the image. In the training process, the method inputs noise and pictures into the network simultaneously, and trains the model by utilizing a generated confrontation network algorithm. After training is completed, the synthesis task of the image can be completed by inputting noise in the generation network. By the method, the advantages of the dynamic self-attention mechanism and the generation of the confrontation network are fully exerted, the calculation complexity of the multi-head self-attention mechanism can be greatly reduced by the proposed dynamic self-attention module, and the problems of mode collapse, unstable training and the like of the generation of the confrontation network are solved.

For convenience in describing the present disclosure, certain terms are first defined.

Definition 1: generating a countermeasure Network (GAN) is composed of a set of antagonistic neural networks (called generator and arbiter, respectively) that choose to distribute p from a certain data _z (z) as input, noise z obtained by random sampling. Then, the generator G establishes mapping between the data distribution and the target data distribution, the input of the discriminator is the real sample or the output of the generator G, and the discriminator is used for distinguishing the output of the generator from the real sample as much as possible. The output of discriminator D is a scalar D (x) which represents the probability that the input sample x is from real data rather than synthetic data. In the actual training process, an alternating training mode is generally adopted to make the arbiter and the generator advance to the optimal direction. Firstly, fixing parameters of a generator G, and training a discriminator D by maximizing a target function, so that the discrimination accuracy of the discriminator D is optimized; then, parameters of the discriminator D are fixed, and the result generated by the discriminator G is close to real data by minimizing a target function, so that the accuracy of the discriminator is reduced; the process of alternating training is then repeated, and when the result of the generator is consistent with the true data distribution, the objective function reaches a global optimal solution. The objective function of the optimization process can be expressed by the following formula:

In the above formula min and max represent the mathematical signs min and max, respectively, E [ ·]Representing the mathematical expectation of the distribution of the sought data, x representing the real data, p _data (x) Is a true data distribution, z denotes the distribution p from a certain data _z (z) randomly sampling the resulting vector.

Respectively represent the pairs x ∈ p _data (x) And z ∈ p _z (z) expectation.

Definition 2: a non-local self-attention mechanism. The non-local self-attention mechanism typically includes 3 modules, query, key, and value. The query and the key firstly carry out correlation operation, and finally carry out weighting operation with value, wherein the core operator is

i denotes the index of the output position and j denotes the index enumerating all possible positions. x represents the input image, f (-) is used to compute the correlation between the i location pixel and all possible location pixels, g (x) represents the arbitrary transformation of x, and c (x) is used to normalize the computed result.

Definition 3: a multi-head self-attention mechanism. The multi-head self-attention mechanism can directly model more complex long-range correlation between pixels, and each self-attention head can learn correlation matrixes of different modes, which plays an important role in improving the generated result. Its core operators are:

MultiHead(Q，K，V)＝Concat(head ₁ ，...，head _h )W ^O

The calculation of each of the attention heads in the formula is the same as the calculation of the non-local attention mechanism in definition 2, except that the calculation results obtained by the calculation heads are spliced and then processed by the matrix W ^O Projected back to the original size. Since the parameters between each of the self-attention heads are not shared, they can be operated in parallel to greatly reduce the computational complexity.

Definition 3: a neural network. Neural networks typically include an input layer, an output layer, and a hidden layer. The input layer is a collection of neurons that accept a large amount of non-linear input data. The output layer is the neuron combination of the final output result. The hidden layer is a layer composed of a plurality of neurons between the input layer and the output layer.

Definition 4: a non-linear activation function. The nonlinear activation function is an indispensable basic unit in the neural network, and has the functions of enhancing the nonlinearity of the network and improving the modeling capability of the network on nonlinear data. Common activation functions include Sigmoid function, tanh function, modified linear unit ReLU.

Definition 5: the image convolution is convolved with the transpose. Image convolution and transposed convolution are commonly used in feature extraction and image synthesis, respectively, in deep learning, and can be viewed as operations in opposite directions. The convolution operation can realize functions similar to human eyes, namely extracting local features of the image, and meanwhile, the convolution operation realizes functions of parameter sharing and data dimension reduction. The transposed convolution is also called deconvolution, and the low-dimensional image features can generate a high-dimensional image through a series of transposed convolution operations, so that the transposed convolution is mostly used for image generation.

Definition 6: convolutional Neural Network (CNN). Convolutional neural networks are typically composed of one or more convolutional layers together with a top fully-connected layer, and often also contain pooling layers. Compared with other depth models, the convolutional neural network can obtain better results in the field of image and voice recognition.

Definition 7: residual Neural Network (Residual Neural Network). Compared with the traditional convolutional neural network, the residual error network adds a short connection mode which is proved to exceed the traditional straight-through convolutional neural network in efficiency and accuracy. When the network is trained, the residual error network module has obvious advantages, and the gradient which is propagated reversely can be directly propagated from the high layer to the bottom layer when passing through the residual error network module, so that the network can select which modules are to be adjusted, and the network module can be kept stable during training.

Definition 8: a normal distribution. Also known as gaussian distribution, is a very common continuous probability distribution. Normal distributions are statistically significant and are often used in natural sciences and engineering to represent an unknown random variable. If the random variable x, its probability density function satisfies

Where μ is the mathematical expectation of a normal distribution, σ ² The variance of a normal distribution is said to satisfy the normal distribution, and is often referred to as

Definition 9: softmax function. The softmax function can compress a K-dimensional vector x containing arbitrary real numbers into another K-dimensional real vector softmax (x) such that each element ranges between (0, 1) and the sum of all elements is 1. The formula can be expressed as:

definition 10: and (5) one-hot coding. Because the computer can not understand the non-binary data, the one-hot code can convert the class label data into a uniform binary digital format, so that the processing and calculation of a machine learning algorithm are facilitated. The image tag in the invention is converted into a one-hot vector with fixed dimension by using the coding method. Most of numbers in the one-hot vector data are 0, and the use amount of a computer memory can be saved by using the sparse data structure.

Therefore, the technical scheme of the invention is an image synthesis method for generating an antagonistic network based on dynamic self-attention, which comprises the following steps:

step 1: preprocessing the data set;

after a CIFAR10 data set is obtained, firstly, images are classified according to class labels of the data set, and then the class labels are coded by utilizing one-hot vectors; finally, normalizing the image pixel values and saving the data as tensor so as to generate a confrontation network for use;

And 2, step: constructing a convolutional neural network;

constructing a convolutional neural network comprising two sub-networks, one of which is a generator and the other is a discriminator; the input of the generator is Gaussian noise, the output of the generator is an image, the input of the discriminator is an image, and the output of the discriminator is a scalar; the first layer of the generator network is a linear full-connection layer, then three up-sampling residual error network blocks are sequentially connected, and finally a standard convolution block is connected; the discriminator network sequentially adopts two down-sampling residual error network blocks, two standard residual error network blocks and a linear full-connection layer; the standard convolution block, the upsampled residual network block, the downsampled residual network block, and the residual network block are shown in fig. 4.

And step 3: constructing a dynamic multi-head attention module;

after Gaussian noise is sent to a generator in a convolutional neural network, a feature map obtained through output of an up-sampling residual error network block in the generator is X, and the size of the feature map is H multiplied by W multiplied by C, wherein C is the number of channels of the feature map, and H and W are the height and width of the feature map respectively; reshaping X to nxc, wherein N ═ hxw;

the first step is to calculate the dynamic attention weight of X

Wherein M is the number of the self-attention heads; inputting X into a grouping convolution, and obtaining a query feature map group set, a key feature map group set and a value feature map group set; thirdly, selecting corresponding optimal query feature maps, key feature maps and value feature maps from the 3 feature map group sets by using the dynamic attention weight z; fourthly, respectively carrying out dimension reduction transformation on the selected key characteristic diagram and the value characteristic diagram by using dimension reduction transformation E and dimension reduction transformation F, and reconstructing a characteristic diagram X by using the query characteristic diagram and the two characteristic diagrams after dimension reduction;

And 4, step 4: designing a total neural network;

embedding the dynamic multi-head self-attention module in the step 3 into the generator in the step 2, wherein the embedding position is behind the last up-sampling residual error network block of the generator; during training, after Gaussian noise is sent into the generator, a feature graph X is obtained through output of an up-sampling residual error network block in the generator, a reconstructed feature graph X is obtained after the feature graph X passes through the dynamic multi-head self-attention module in the step 3, an output picture is obtained after the reconstructed feature graph X passes through an output convolution layer in the generator, and the output picture of the generator is used as input of the discriminator.

And 5: designing a loss function;

recording the picture acquired in the step 1 as I; and randomly sampling the normal distribution to obtain a vector

The generator network in the step 2 is marked as G, and the discriminator network is marked as D; the input of the generator G is v, and the output of the generator G is marked as G (v); d input of the discriminator is I and G (v), and output of the discriminator is respectively marked as D (I) and D (G (I)); the loss of the network is:

in order to be a loss function of the discriminator,

a loss function for the generator;

respectively representing the expectation of I and v;

step 6: training a total neural network;

performing network training by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, fixing the parameter of G when D is updated, and alternately updating once every iteration;

And 7: testing the total neural network;

training the model in step 6, and only taking a generator G; and inputting different noise samples in the normal distribution into G to obtain a plurality of different output pictures.

The specific method of the step 3 comprises the following steps:

step 3.1: calculating dynamic attention weights

Calculating the probability value of each self-attention head being selected, wherein M is the number of the self-attention heads; z is derived from the attention module pi,

represents the ith dimensional component of the z-vector and it represents the probability that the ith self-attention head was selected; the input of the attention module pi is the feature map X, the output z _i The component is output by the ith neuron of the fully-connected layer in the attention module pi;

step 3.2: calculating a feature map group set;

calculating a query feature map, a key feature map and a value feature map of each self-attention head; all query feature maps of the self-attention heads jointly form a query feature map group set, the set is represented by Q, the size is M multiplied by C multiplied by N, M is the number of the attention heads, and the meanings and numerical values of C and N are consistent with those of the feature map X; similarly, a key feature map group set and a value feature map group set are obtained, and are respectively represented by K and V, and the size of the key feature map group set is the same as that of Q; q, K and V can be obtained by inputting the feature diagram X into three different grouping convolutions; q _i ，K _i ，V _i Respectively showing a query feature map, a key feature map and a value feature map of the ith self-attention head;

step 3.3: selection from the attention head;

selecting an optimal self-attention head according to the dynamic attention weight z; first search for the maximum component in the dynamic attention weight z, assuming the jth dimension z of z _j Is the greatest, which represents the highest probability that the jth self-attention head is selected; in order to be able to propagate and calculate the gradient backwards, the dynamic attention weight z is thinned out, i.e. the component z of the jth dimension of the weight z is _j Setting 1, and setting the components of other dimensions to 0; the thinned dynamic attention weight z is then weighted ^* Q, K, V are weighted separately, due to the weight z ^* Only the j-th dimension component

1 and the remaining components are 0, so the weighting process is actually to select the jth of the j' th attention head Q among Q, K, V _j ，K _j And V _j ；Q _j ，K _j ，V _j Respectively showing a query feature map, a key feature map and a value feature map of the jth attention head; their size is consistent with the signature X;

step 3.4: carrying out dimension reduction transformation and reconstruction on the feature map;

using dimension reduction transform to pair selected K _j ，V _j Performing dimensionality reduction treatment; the dimension reduction method uses common pooling operation or down-sampling convolution transformation, and the dimension reduction is processed

V _j ^* All have the same size of DxC, wherein D ═ H ^* ×W ^* ；H ^* And W ^* Respectively the height and width of the feature map after dimension reduction; first pair Qj and inverted

Matrix multiplication is carried out, softmax normalization is carried out on the operation result to obtain a self-attention correlation matrix B,

the size of B is NxD; then pair B and V _j ^* Carrying out matrix multiplication to obtain a reconstruction characteristic diagram X', X ═ BV _j ^* The size of X 'is NxC, and the size of X' is consistent with that of the original characteristic diagram X; adding X' and X as output reconstruction characteristic graph, using X ^* To indicate that the size is consistent with X; finally, changing X into the shape H multiplied by W multiplied by C of the original input feature diagram.

The innovation of the invention is that:

(1) aiming at the problem of high computational complexity of the existing non-local self-attention mechanism, a key feature map and a value feature map in the self-attention mechanism computation process are subjected to dimension reduction by convolution transformation and pooling operation, as shown in fig. 2.

(2) Aiming at the problems that the existing multi-head self-attention mechanism is high in calculation complexity and lacks connection and constraint among different attention heads, the idea of dynamic convolution is introduced into the calculation process of the multi-head self-attention mechanism, and a proper self-attention head is selected by using the dynamic self-attention weight obtained through calculation, as shown in fig. 3.

(3) We introduce this mechanism into the generative challenge network approach to complete the image synthesis experiment and achieve excellent results in the experiment.

(1) The improvement in (2) can further reduce the computational complexity, and enable different attention heads to obtain good connection and cooperation, and finally the result of the image synthesis experiment is improved through the combination of the two.

Drawings

FIG. 1 is a diagram of the main network structure of the method of the present invention

FIG. 2 is a schematic view of the self-attention mechanism dimension reduction of the method of the present invention

FIG. 3 is a schematic diagram of the dynamic self-attention mechanism of the method of the present invention

FIG. 4 is a diagram of a standard rolling block, a residual block, an upsampled residual block, and a downsampled block of the method of the present invention

Detailed Description

Step 1: preprocessing the data set;

a CIFAR10 dataset was obtained, the CIFAR10 dataset consisting of 10 classes of 32 × 32 natural color images and their corresponding class labels, comprising a total of 60000 images and their corresponding labels. First, the images can be classified into 10 categories according to the category labels of the dataset. The class labels are then encoded using the one-hot vector. Finally, the picture pixel values are normalized to the range [ -1, 1], and the data is saved as a tensor to be used in order to generate the countermeasure network.

Step 2: constructing a convolutional neural network;

the step of constructing the convolutional neural network comprises two sub-networks, wherein one sub-network is a generator, and the other sub-network is a discriminator; the generator inputs gaussian noise and its output is an image, while the discriminator inputs an image and its output is a scalar. The first layer of the generator network is a linear full-connection layer, then three up-sampling residual error network blocks are connected, and finally a standard convolution block is connected; the discriminator network sequentially adopts two down-sampling residual error network blocks, two standard residual error network blocks and a linear full-connection layer.

And step 3: constructing a dynamic multi-head attention module;

the first step is to calculate the dynamic attention weight of X

Wherein M is the number of the self-attention heads; inputting the X into a grouping convolution to obtain a query feature map group set, a key feature map group set and a value feature map group set; thirdly, selecting corresponding optimal query feature maps, key feature maps and value feature maps from the 3 feature map group sets by using the dynamic attention weight z; fourthly, respectively carrying out dimension reduction transformation on the selected key characteristic diagram and the value characteristic diagram by using dimension reduction transformation E and dimension reduction transformation F, and reconstructing a characteristic diagram X by using the query characteristic diagram and the two characteristic diagrams after dimension reduction;

And 4, step 4: designing a total neural network;

embedding the dynamic multi-head self-attention module in the step 3 into the generator in the step 2, wherein the embedding position is behind the last up-sampling residual error network block of the generator; during training, after Gaussian noise is sent to the generator, a feature map X is obtained through output of an up-sampling residual error network block in the generator, the feature map X is subjected to the dynamic multi-head self-attention module in the step 3 to obtain a reconstructed feature map X, the reconstructed feature map X is subjected to output convolution layers in the generator to obtain an output picture, and the output picture of the generator is used as input of the discriminator.

And 5: designing a loss function;

The generator network in the step 2 is marked as G, and the discriminator network is marked as D; the input of the generator in G is v, and the output of the generator is marked as G (v); the inputs of the arbiter are I and G (v), and their outputs are denoted as D (I) and D (G (I)), respectively. The loss of the network can be described as:

in order to be a loss function of the discriminator,

a loss function for the generator;

respectively representing the expectation of I and v;

step 6: training a total neural network;

performing network training by using the loss function constructed in the step 5, fixing the parameter of D when G is updated, and if D is updated, updating is performed alternately once in each iteration, and 200000 iteration times are adopted in actual training;

And 7: testing the total neural network;

the model is trained in step 6, taking only generator G. And inputting different noise samples in normal distribution into G to obtain a plurality of different output pictures, and testing the quality and diversity of the obtained pictures. According to an experimental result, on a test data set of CIFAR10, the Inceposition Score index of a generated picture is improved by 0.17 minute compared with the former 8.31 minute and reaches 8.48 minutes; the FID index of the generated picture is improved by 0.95 point compared with the former 12.02 point and reaches 11.07 points; the computational complexity of the model is represented by O (kn) ² ) The improvement is O (n);

the specific method of the step 3 comprises the following steps:

step 3.1: calculating dynamic attention weights

step 3.2: calculating a feature map group set;

calculating a query feature map, a key feature map and a value feature map of each self-attention head; all query feature maps of the self-attention heads jointly form a query feature map group set, the set is represented by Q, the size is M multiplied by C multiplied by N, M is the number of the attention heads, and the meanings and numerical values of C and N are consistent with those of the feature map X; similarly, a key feature map group set and a value feature map group set are obtained and respectively represented by K and V, and the size of the key feature map group set is the same as that of Q; q, K and V can be obtained by inputting the feature diagram X into three different grouping convolutions; q _i ，K _i ，V _i Respectively showing a query feature map, a key feature map and a value feature map of the ith self-attention head;

step 3.3: selection from the attention head;

V _j ^* Are all of the same size as D × C, wherein D ═ H ^* ×W ^* ；H ^* And W ^* Respectively the height and width of the feature map after dimension reduction; firstly, to Q _j And after inversion

the size of B is NxD;then pair B and V _j ^* Carrying out matrix multiplication to obtain a reconstruction characteristic diagram X', X ═ BV _j ^* The size of X 'is NxC, and the size of X' is consistent with that of the original characteristic diagram X; adding X' and X as output reconstruction characteristic graph, using X ^* To indicate that the size is consistent with X; finally, X is ^* Changing to the shape H × W × C of the original input feature map.

Claims

1. An image synthesis method based on a dynamic self-attention generation countermeasure network, the method comprising:

step 1: preprocessing the data set;

after a cifar10 data set is obtained, firstly, images are classified according to class labels of the data set, and then the class labels are encoded by utilizing one-hot vectors; finally, normalizing the image pixel values and saving the data as tensor so as to generate a confrontation network for use;

step 2: constructing a convolutional neural network;

constructing a convolutional neural network comprising two sub-networks, one of which is a generator and the other is a discriminator; the input of the generator is Gaussian noise, the output of the generator is an image, the input of the discriminator is an image, and the output of the discriminator is a scalar; the first layer of the generator network is a linear full-connection layer, then three up-sampling residual error network blocks are sequentially connected, and finally a standard convolution block is connected; the discriminator network sequentially adopts two down-sampling residual error network blocks, two standard residual error network blocks and a linear full-connection layer;

And 3, step 3: constructing a dynamic multi-head self-attention module;

the first step is to calculate the dynamic attention weight of X

and 4, step 4: designing a total neural network;

embedding the dynamic multi-head self-attention module in the step 3 into the generator in the step 2, wherein the embedding position is behind the last up-sampling residual error network block of the generator; during training, after Gaussian noise is sent into a generator, a feature map X is obtained through output of an up-sampling residual error network block in the generator, and a reconstructed feature map X is obtained after the feature map X passes through the dynamic multi-head self-attention module in the step 3 ^* Reconstructing the feature map X ^* Obtaining an output picture through the output convolution layer in the generator, and taking the output picture of the generator as the input of the discriminator;

and 5: designing a loss function;

in order to be a loss function of the discriminator,

a loss function for the generator;

respectively representing the expectation of I and v;

step 6: training a total neural network;

and 7: testing the total neural network;

2. The image synthesis method based on the dynamic self-attention generation countermeasure network as claimed in claim 1, wherein the specific method of step 3 is:

Step 3.1: calculating dynamic attention weights

represents the ith dimensional component of the z-vector and it represents the probability that the ith self-attention head was selected; the input of the attention module pi is the feature map X, the output z _i The component is taken care ofThe output of the ith neuron of the full connection layer in the force module pi is obtained;

step 3.2: calculating a feature map group set;

step 3.3: selection from the attention head;

the size of B is NxD; then pair B and V _j ^* Carrying out matrix multiplication to obtain a reconstruction characteristic diagram X', X ═ BV _j ^* The size of X 'is NxC, and the size of X' is consistent with that of the original characteristic diagram X; adding X' and X as output reconstruction characteristic graph, using X ^* To indicate that the size is consistent with X; finally, X is ^* Changing to the shape H × W × C of the original input feature map.