CN113298235A

CN113298235A - Neural network architecture of multi-branch depth self-attention transformation network and implementation method

Info

Publication number: CN113298235A
Application number: CN202110648214.XA
Authority: CN
Inventors: 李云响; 王亚奇; 章一帆; 夏能; 彭睿孜; 唐凯; 俞定国; 张随雨
Original assignee: Zhejiang University of Media and Communications
Current assignee: Zhejiang University of Media and Communications
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-08-24

Abstract

The invention discloses a neural network architecture of a multi-branch depth self-attention transformation network and an implementation method, wherein the neural network architecture comprises the following steps: the first 4 stages of resenxt and two branches and a branch fusion module, where the two branches are local feature branches and global feature branches, respectively. Firstly, inputting the first 4 basic stages passing through ResNeXt, then merging the obtained characteristic layers, passing through a branch fusion module to obtain a final characteristic layer, and classifying through a full connection layer. The network architecture of the invention focuses on the information of the current state of the network in the local characteristic branch aiming at the image and extracts the global information of the network in the global characteristic branch, the multi-branch structure greatly improves the information extraction capability of the network to the image, the accuracy of the network is improved by adding channel weight information in the branch fusion module, and the simple network structure is easy to self-define and modify, and the robustness to related image tasks is increased.

Description

Neural network architecture of multi-branch depth self-attention transformation network and implementation method

Technical Field

The invention relates to the field of deep learning network architectures, in particular to a neural network architecture of a multi-branch deep self-attention transformation network and an implementation method.

Background

In the past few years, Convolutional Neural Networks (CNNs) have become the predominant machine learning method used for various tasks in the field of computer vision, including image recognition, object detection, and semantic segmentation. While a good neural network framework can generally maintain or even improve the accuracy of the results with a reduced computational effort.

The deep learning network architecture is one of main research contents for solving the problems of image classification, target detection, semantic segmentation, human body posture estimation and the like in the field of computers. In the present invention, the proposed portable block is part of a deep learning network architecture.

The invention and optimization of the deep learning network architecture are one of the research hotspots of the current deep learning. The method can be applied to the fields of future medical treatment, unmanned vehicles, voice recognition and the like. The good deep learning network framework can help to improve the detection accuracy, accelerate the network operation rate and the like.

Disclosure of Invention

The network architecture provided by the invention fully extracts local information and global information in the network by a multi-branch structure so as to improve the information extraction capability of the network, and increases weight representation on a channel of a feature layer to enrich the information expression capability of the feature layer.

The invention mainly aims at further improvement and optimization of a deep learning network architecture model and provides a neural network architecture of a multi-branch deep self-attention transformation network and an implementation method. The main network of the network architecture is a ResNeXt network, and the global characteristic branch and the local characteristic branch of the multi-branch structure effectively optimize the network without damaging the main flow network architecture, so that the network architecture is better improved.

A neural network architecture for a multi-branch deep self-attention transforming network, comprising:

receiving a convolution information extraction structure of an image;

receiving a local feature extraction branch and a global feature extraction branch output by the convolution information extraction structure, wherein the local feature extraction branch and the global feature extraction branch are of parallel structures;

a branch fusion module for receiving the output of the local feature extraction branch and the global feature extraction branch;

and the full connection layer is connected with the branch fusion module.

In the invention, the convolution information extraction structure adopts the first 4 stages of the ResNeXt network.

The local feature extraction branch comprises: the block convolution single-channel module comprises a plurality of local information extraction units, wherein the local information extraction units are 1 multiplied by 1 convolutional layers, 3 multiplied by 3 convolutional layers and 1 multiplied by 1 convolutional layers. The number of the grouped convolution single-channel modules is 32 which are connected in parallel. The packet convolution single channel module effectively reduces the parameter amount of the network and gives a weight which can be learned by each packet. The local feature extraction branch mainly focuses on and extracts the current feature layer, extracts new features from the last feature layer and establishes a new feature layer.

The global feature extraction branch comprises:

a downsampling convolutional layer connected with the output end of the convolutional information extraction structure;

a global feature extraction unit connected with the downsampling convolution layer;

and the up-sampling module is connected with the global feature extraction unit.

The global feature extraction unit comprises: a plurality of bottleneck depth self-attention transformation modules, said bottleneck depth self-attention transformation modules comprising: a 1 × 1 convolutional layer, a multi-head self-attention Module (MHSA), and a 1 × 1 convolutional layer connected in this order. The bottleneck depth self-attention transformation module can generate a more interpretable model, and each attention head can learn to perform different tasks. The global feature branch can model the interaction between remote information in the network, so that the attention and extraction of the network to the global information can be improved.

The branch fusion module comprises:

a branch characteristic connection module;

the channel relation learning branch and the reference branch are connected with the branch characteristic connection module;

the channel reweighting module is connected with the channel relation learning branch and the reference branch;

and the channel probability discarding layer is connected with the channel re-weighting module.

The channel relation learning branch comprises a pooling layer, a 1 × 1 convolution layer, a ReLU layer, a 1 × 1 convolution layer and a Sigmoid layer which are sequentially connected. The branch fusion module can combine two feature layers into one feature layer, and each channel of the feature layers can be given the ability of learning weight through the layers, so that the attention of the network to the feature layer channels is optimized.

A method for realizing a neural network architecture of a multi-branch depth self-attention transformation network comprises the following steps:

s1, inputting the image into the first 4 stages of ResNeXt to obtain a characteristic layer;

s2, down-sampling the feature layer obtained in the step S1, and carrying out batch normalization operation on the obtained feature layer;

s3, passing the new feature layer obtained in the step S2 through three bottleneck depth self-attention transformation modules, wherein each bottleneck depth self-attention transformation module comprises a 1 × 1 convolutional layer, a multi-head self-attention module and a 1 × 1 convolutional layer;

s4, carrying out up-sampling on the new feature layer obtained in the step S3 to obtain a feature layer of a global feature branch;

s5, enabling the characteristic layer obtained in the step S1 to pass through the 5 th stage of a ResNeXt network, and evenly dividing the channel of the characteristic layer into 32 parallel grouped convolution single-channel modules;

in step S5, since the total number of channels of the feature layer is 1024, the number of channels of each new feature layer is 32.

The block convolution single-channel module comprises a plurality of local information extraction units, wherein the local information extraction units are 1 × 1 convolutional layers, 3 × 3 convolutional layers and 1 × 1 convolutional layers.

And S6, weighting and combining the 32 characteristic layers obtained in the step S5 to obtain the characteristic layer of the local characteristic branch.

And S7, merging the feature layer of the global feature branch obtained in the step S4 and the feature layer of the local feature branch obtained in the step S6 to obtain a new feature layer.

And S8, passing the feature layer obtained in the step S7 through a branch fusion module to enable each channel in the feature layer to have different weights, and finally passing through a channel probability discarding layer (dropout layer) at the tail end of the module to obtain a new feature layer.

And S9, passing the characteristic layer obtained in the step S8 through a full connection layer to obtain a result.

In step S3, the step size parameter of the convolution kernel used is 2, the convolution kernel size is 3 × 3, the filling pattern is (1,1), and the number of convolution kernels is 1024. The resulting feature layer after convolution is [32 × 1024], where 1024 is the number of channels and 32 is the length and width, respectively.

In step S3, the phase 5 group convolution of the resenext network participates in the multi-branch structure becoming a local feature branch, and the same network structure but where the convolution layer of 3 × 3 is replaced by the multi-head self attention layer (MHSA) to form a new bottleneck depth self attention network, and the operations of down-sampling and up-sampling are added, and the other operations remain unchanged, the structure is called a global feature branch. Because the branch of the multi-head self-attention layer can extract information of the global network, compared with the local information extracted by only the local characteristic branch, the information extraction extent of the network to the image is improved.

In step S3, the obtained feature layer is input, the number of channels is changed by 1 × 1 convolution, and then 1 × 1 convolution is performed to obtain different query feature layers (q), key feature layers (k) and value specific diagnosis layers (v), respectively.

Firstly, relative position coding is carried out in a two-dimensional space, and a relative position coding layer with the same size as the query feature layer, the key feature layer and the value feature layer and the same channel number is obtained.

Secondly, the query feature layer and the key feature layer are point-multiplied to obtain qk^T(kT is the transpose of k), qk to prevent the softmax operation from over-amplifying the key with the larger value^TNeeds to be divided by √ (C), the query feature layer dot-multiplied by the relative position code layer to obtain qr^T(rT is the transpose of r), the two are subjected to matrix addition to obtain a feature key, and then softmax operation is performed.

And finally, multiplying the obtained feature layer and the value feature layer point to obtain an output feature layer (z) with the same size as the input feature layer.

Due to the multi-head self-attention mechanism, an input feature layer may go through the above steps several times but the parameters are different. And combining a plurality of z into one characteristic layer, and performing convolution operation to keep the size of the obtained characteristic layer the same as that of the input characteristic layer.

In step S3, since the self-attention mechanism is not a cross-over operation, we use a mean pooling layer with step size of 2 and size of 2 × 2.

In step S4, the upsampling process makes the obtained new feature layer have the same length and width as the feature layer to be merged by the other branch through a bilinear interpolation algorithm.

In steps S2 through S6, the ResNeXt network residual (shortcut) structure is still maintained.

In step S8, the branch fusion module includes:

a branch characteristic connection module;

The channel relation learning branch comprises a pooling layer, a 1 × 1 convolution layer, a ReLU layer, a 1 × 1 convolution layer and a Sigmoid layer which are sequentially connected.

The branch fusion module includes: a global average pooling layer with an output size of 1 × 1 and a channel number of 4096, a full connection layer with an output channel number of 256, a ReLU activation function, a full connection layer with an output channel number of 4096, a Sigmoid activation function. And obtaining the weight of each corresponding channel of the feature layer, and finally obtaining a new feature layer through the dropout layer with the random probability of 0.5.

The method comprises the following steps: 1) the first 4 stages of resenxt and two branches and a branch fusion module, where the two branches are local feature branches and global feature branches, respectively. 2) The local feature branch in the step 1) is built based on a group convolution structure in ResNeXt, and comprises 32 convolution channels, each convolution channel comprises three same bottleneck convolution blocks, each bottleneck convolution block comprises three convolution layers, and finally the feature layers obtained by each convolution channel are weighted and combined to finally obtain the feature layer of the local feature branch. 3) The global feature branch in the step 1) is constructed based on a Bottleneck transformations network, and comprises a first down-sampling, three Bottleneck depths are self-attention transformed into network blocks, and finally up-sampling is carried out to obtain a feature layer of the global feature branch. 4) The branch fusion module in the step 1) is designed based on the Squeezer-and-Exception block, the structure of the branch fusion module enables each channel in the feature layer to have different weights, and a dropout layer is added at the tail end of the module to reduce the calculation amount of the network. 5) Firstly, inputting the first 4 basic stages of ResNeXt in the step 1), then merging the characteristic layers respectively obtained in the step 3) and the step 5), obtaining a final characteristic layer through a branch fusion module, and classifying through a full connection layer.

Compared with the prior art, the invention has the following advantages:

the invention optimizes the information extraction capability of the network through a multi-branch structure, focuses on global network information and relatively balances local network information, and strengthens the channel information expression of the characteristic layer by adding weight to the channel of the characteristic layer.

The network architecture of the invention focuses on the information of the current state of the network in the local characteristic branch aiming at the image and extracts the global information of the network in the global characteristic branch, the multi-branch structure greatly improves the information extraction capability of the network to the image, the accuracy of the network is improved by adding channel weight information in the branch fusion module, and the simple network structure is easy to self-define and modify, and the robustness to related image tasks is increased.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a diagram of a network architecture according to the present invention;

FIG. 3 is a schematic diagram of a multi-head self-attentive force mechanism according to the present invention;

fig. 4 is a comparison of the results of the inventive network and other classical networks.

Detailed Description

As shown in fig. 1, an implementation method of a neural network architecture of a multi-branch deep self-attention transformation network includes the following steps:

and S1, inputting the image into the first 4 stages of ResNeXt to obtain the characteristic layer.

S2, the feature layer obtained in step S1 is down-sampled, and the obtained feature layer is subjected to batch normalization.

And S3, passing the new feature layer obtained in the step S2 through three bottleneck depth self-attention transformation network blocks, wherein each bottleneck depth self-attention transformation network block comprises a 1 × 1 convolutional layer, a multi-head self-attention layer and a 1 × 1 convolutional layer.

And S4, upsampling the new feature layer obtained in the step S3 to obtain a feature layer of a global feature branch.

S5, passing the feature layer obtained in step S1 through stage 5 of the resenext network, i.e. dividing the channels of the feature layer into 32 channels, because the total channel number of the feature layer is 1024, the channel number of each new feature layer is 32.

And S6, passing the feature layer obtained in the step S4 through a convolution channel, wherein the convolution channel comprises three identical bottleneck convolution blocks, and each bottleneck convolution block comprises three convolution layers.

And S7, weighting and combining the 32 characteristic layers obtained in the step S5 to obtain the characteristic layer of the local characteristic branch.

And S8, merging the feature layer of the global feature branch obtained in the steps S4 and S7 with the feature layer of the local feature branch to obtain a new feature layer.

And S9, passing the feature layer obtained in the step S8 through a branch fusion module, wherein the branch fusion module comprises a pooling layer, a full connection layer, a ReLU layer, a full connection layer and a sigmoid layer, so that each channel in the feature layer has different weights, and finally passing through a dropout layer at the tail end of the module to obtain a new feature layer.

And S10, passing the characteristic layer obtained in the step S9 through a full connection layer to obtain a result.

In step 2), the step size parameter of the convolution kernel used is 2, the convolution kernel size is 3 × 3, the filling pattern is (1,1), and the number of convolution kernels is 1024. The resulting feature layer after convolution is [32 × 1024], where 1024 is the number of channels and 32 is the length and width, respectively.

In step 3), inputting the obtained feature layers, changing the number of channels through 1 × 1 convolution, and then performing 1 × 1 convolution to obtain different query feature layers (q), key feature layers (k) and value specific diagnosis layers (v) respectively.

In step 3), since the self-attention mechanism is not a cross-over operation, we use a mean pooling layer with step size of 2 and size of 2 × 2.

The phase 5 group convolution of the ResNeXt network participates in a multi-branch structure to become a local feature branch, a new bottleneck depth self-attention network is formed by replacing a convolution layer with 3 x 3 of the network structure with a multi-head self-attention layer (MHSA), the operations of down-sampling and up-sampling are added, and other operations are kept unchanged, and the structure is called a global feature branch. Because the branch of the multi-head self-attention layer can extract information of the global network, compared with the local information extracted by only the local characteristic branch, the information extraction extent of the network to the image is improved.

In step 4), the upsampling process makes the obtained new feature layer have the same length and width as the feature layer to be merged in the other branch through a bilinear interpolation algorithm.

In steps 2) to 7), the ResNeXt network residual (shortcut) structure is still maintained.

In step 9), the feature layer obtained in step 8) passes through a branch fusion module, which includes a pooling layer, a full connection layer, a ReLU layer, a full connection layer and a sigmoid layer, so that each channel in the feature layer has different weights, and finally a new feature layer is obtained through a dropout layer at the end of the module.

As shown in fig. 2, the network block specifically implements the method:

1) based on ResNeXt as a basic backbone, ResNeXt usually has 5 block groups, the first 4 block groups are not changed, and a multi-branch structure is added to the 5 th block group.

2) The multi-branch structure is divided into a local feature branch and a global feature branch, the local feature branch is the 5 th block group of ResNeXt, and the convolution of the group is not changed. The size of the feature layer finally output by the 4 th block group is (1024 ), wherein 1024 is the size obtained by multiplying the length and the width of the feature layer respectively, and 1024 is the number of channels. In the global feature branch, the feature layer is first sampled downwards, the step size parameter of the convolution kernel used is 2, the size of the convolution kernel is 3 x 3, the number of the convolution kernels is 1024, the filling mode is (1,1), the size of the new feature layer obtained after convolution is (1024 ),

3) the step size parameter of the convolution kernel of the new feature layer is 1, the size of the convolution kernel is 1 x 1, the number of the convolution kernels is 512, and the size of the new feature layer obtained by normalization after convolution is 1024 or 512.

4) And (3) obtaining a group of a characteristic layers by convolution of 3 a (a is the number of heads) times of convolution kernel with the size of 1 x 1, the step length of 1 and the number of convolution kernels of C, wherein each group is respectively provided with a query characteristic layer (q), a key characteristic layer (k) and a value diagnosis layer (v), and the sizes of the key characteristic layer (k) and the value diagnosis layer (v) are respectively (1024 ).

Secondly, the query feature layer and the key feature layer are point-multiplied to obtain qk^TSize 32 × 32, qk to prevent the softmax operation from excessively amplifying the key with a large value^TNeeds to be divided by √ (C). The query feature layer and the relative position coding layer are subjected to dot multiplication to obtain qr^T(r^TIs the transpose of r), the feature key is obtained by matrix addition of the two, and then softmax operation is carried out.

And finally, multiplying the obtained feature layer and the value feature layer point to obtain an output feature layer (z) with the same size as the input feature layer. Since the self-attention mechanism is not a cross-over operation, we use an average pooling layer of size 2 x 2 with step size 2 x 2. Since we use a headers, a numerically different feature layers z are generated. And combining the a characteristic layers into a characteristic layer with a channel layer number a × 512, and reducing the channel number of the characteristic layer to 512 through a convolution kernel with the convolution kernel number of 512.

5) The new feature layer is obtained by performing normalization after convolution, wherein the size of the new feature layer is (512,2048), the step parameter of the convolution kernel is 1, the size of the convolution kernel is 1 x 1, and the number of the convolution kernels is 2048.

6) And repeating the operation for one block for 2 times according to the steps 3 to 5 to obtain a new feature layer.

7) And the new feature layer is downsampled by an algorithm, so that the obtained new feature layer has the same length and width as the feature layer to be accessed by the original network. Since the operation cannot be completed by the ordinary convolution operation, a new feature layer is obtained by filling the linear relation among pixels by using a bilinear interpolation algorithm.

8) And (3) entering the feature layer obtained in the step (7) and the feature layer obtained by local feature branching into a branch fusion module, firstly merging to obtain a feature layer with the size of (1024, 4096), obtaining a feature layer with the size of (1, 4096) through a global average pooling layer, obtaining a feature layer with the size of (1, 256) through a full connection layer, obtaining a feature layer with the size of (1, 4096) through the full connection layer through a ReLU activation function, and obtaining a new feature layer through the ReLU activation function, wherein the feature layer is increased in calculation amount due to overlarge feature layer, and then passing through a dropout layer with the random probability of 0.5 to reduce the calculation amount and obtain the new feature layer.

9) The new feature layer finally obtains the required result through the full connection layer.

Fig. 4 is a diagram illustrating the results of the network of the present invention and other classical networks, wherein an AGMB-Transformer (AGMB-Transformer) is an abbreviation of a multi-branch depth self-attention transformation network. The network models are ResNet50, SERESNet50, SERESNEXt50, Inception V3, ViT and the model of the invention (AGMB-Transformer). The evaluation criteria were (ACC) accuracy, (AUC) area under receiver operating characteristic curve, (SEN) sensitivity, (SPC) specificity, (F1 Score) F1 Score, respectively. The training and testing data set was derived from a root canal treatment data set, for a total of 245 root canal images. It can be seen from the graph that all scores of the model of the present invention are greater or significantly greater than those of other classical networks.

The above-described embodiments of the present invention do not limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention shall be included in the protection scope of the claims of the present invention.

Claims

1. A neural network architecture for a multi-branch deep self-attention translation network, comprising:

receiving a convolution information extraction structure of an image;

and the full connection layer is connected with the branch fusion module.

2. The neural network architecture of the multi-branch deep self-attention transforming network of claim 1, wherein the convolution information extraction structure employs the first 4 stages of the ResNeXt network.

3. The neural network architecture of the multi-branch deep self-attention transforming network according to claim 1, wherein the local feature extraction branches comprise: the block convolution single-channel module comprises a plurality of local information extraction units, wherein the local information extraction units are 1 multiplied by 1 convolutional layers, 3 multiplied by 3 convolutional layers and 1 multiplied by 1 convolutional layers.

4. The neural network architecture of the multi-branch deep self-attention transform network of claim 3, wherein the number of the grouped convolution single channel modules is 32 connected in parallel.

5. The neural network architecture of the multi-branch deep self-attention transforming network according to claim 1, wherein the global feature extraction branch comprises:

6. The neural network architecture of the multi-branch deep self-attention transforming network according to claim 5, wherein the global feature extraction unit comprises: a plurality of bottleneck depth self-attention transformation modules, said bottleneck depth self-attention transformation modules comprising: the multi-head self-attention module comprises a 1 × 1 convolutional layer, a multi-head self-attention module and a 1 × 1 convolutional layer which are connected in sequence.

7. The neural network architecture of the multi-branch deep self-attention transforming network according to claim 1, wherein the branch fusion module comprises:

a branch characteristic connection module;

8. The neural network architecture of the multi-branch deep self-attention transforming network of claim 7, wherein the channel relationship learning branch comprises a pooling layer, a 1 x 1 convolutional layer, a ReLU layer, a 1 x 1 convolutional layer, and a Sigmoid layer, which are connected in sequence.

9. The method for implementing the neural network architecture of the multi-branch deep self-attention transformation network according to any one of claims 1 to 8, comprising the following steps:

s3, passing the new feature layer obtained in the step S2 through three bottleneck depth self-attention transformation modules;

s6, weighting and combining the 32 characteristic layers obtained in the step S5 to obtain a characteristic layer with local characteristic branches;

s7, merging the characteristic layer of the global characteristic branch obtained in the step S4 and the characteristic layer of the local characteristic branch obtained in the step S6 to obtain a new characteristic layer;

s8, passing the feature layer obtained in the step S7 through a branch fusion module to enable each channel in the feature layer to have different weights, and finally passing through a channel probability discarding layer at the tail end of the module to obtain a new feature layer;

10. The method according to claim 9, wherein in step S3, each bottleneck depth self-attention transform module comprises a 1 x 1 convolutional layer, a multi-head self-attention module, a 1 x 1 convolutional layer;

in step S5, the block convolution single channel module includes a plurality of local information extraction units, where the local information extraction units are 1 × 1 convolutional layers, 3 × 3 convolutional layers, and 1 × 1 convolutional layers;

in step S8, the branch fusion module includes:

a branch characteristic connection module;

a channel probability discarding layer connected to the channel re-weighting module;