CN113537243A

CN113537243A - Image classification method based on SE module and self-attention mechanism network

Info

Publication number: CN113537243A
Application number: CN202110839024.6A
Authority: CN
Inventors: 梁俊雄; 肖明; 郑坚燚; 曾旺旺; 廖泽宇; 陈俊文
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2021-10-22

Abstract

The invention discloses an image classification method based on an SE module and an attention mechanism network, which comprises the steps of firstly slicing a picture into a plurality of Patchs, adding the Patch0 as classification features, wherein the Patchs are the same but have different positions and correspond to pictures of different categories, adding position information, extracting internal features of each Patch by utilizing a convolutional neural network and the SE module, then extracting global features between the Patchs by utilizing the attention mechanism, inputting the global features into a multi-layer sensing machine, serially stacking the SE module, the attention mechanism and the multi-layer sensing machine into a minimum unit layer, stacking an L minimum unit layer to extract higher global features, enabling the features of the pictures to be richer in representation, finally taking output vectors of each minimum unit layer Patch0, respectively giving different weights to the output vectors, weighting and fusing each Patch0, thus extracting higher-level local and global features of the whole picture, classifying the pictures by utilizing the features, and obtaining a category label corresponding to the picture.

Description

Image classification method based on SE module and self-attention mechanism network

Technical Field

The invention relates to the technical field of image identification and deep learning, in particular to an image classification method based on an SE module and a self-attention mechanism network.

Background

Image classification is a very active research direction in the fields of computer vision, machine learning and deep learning, and is widely applied to face recognition, pedestrian detection, traffic scene object recognition, license plate recognition, automatic album classification and the like.

Image classification is an important basic task in the field of artificial intelligence computer vision and is also the basis of target detection, and the accuracy of image classification influences the performance evaluation of subsequent tasks. At present, a support vector machine image classification method based on machine learning exists, mainstream deep learning image classification methods are mainly divided into two main categories, one category is based on a convolutional neural network, and the following methods are provided: classical neural networks such as AlexNet, VGG, GoogleNet, and ResNet; another category is based on self-attention mechanisms: vision Transformer and Transformer in Transformer.

The most similar to the method is a Vision transform based on a self-attention mechanism, a whole picture is firstly sliced into a plurality of Patch, then global features among the Patch are extracted by the self-attention mechanism, and the overall features are continuously transmitted through a multilayer perceptron. The self-attention mechanism and the multi-layer perceptron stacking form an encoder layer, a Vision Transformer framework is formed by stacking a plurality of encoder layers, and the output of the last encoder layer Patch0 is input to the softmax layer to obtain the image category prediction result.

The invention is relatively similar to the Transformer in Transformer based on the self-attention mechanism, the characteristics between and inside each Patch are respectively extracted by using the encoder layers in the two transformers, the two encoder layers form a module, and the Transformer in Transformer framework is formed by stacking a plurality of modules.

Disclosure of Invention

The invention aims to utilize local features of each Patch, and utilizes global features of the Patch0 of each minimum unit layer, so that more features are utilized to improve the classification accuracy, and provides an image classification method based on an SE module and a self-attention mechanism network.

In order to realize the above purpose of the invention, the following technical scheme is adopted:

an image classification method based on an SE module and a self-attention mechanism network comprises the following steps:

s1: the input picture is converted into a matrix with a specified size, then the data is converted into a tensor data type, and then the data is input into the model.

S2: a picture is sliced into a plurality of Patch, the Patch0 is added to be used as a classification feature, and a position information rich feature representation is added.

S3: features are extracted internally for each Patch using the SE module.

S4: features between each Patch are extracted using a self-attention mechanism.

S5: the output data from the attention mechanism is put into two layers of MLPs.

S6: the modules formed by S3, S4 and S5 are stacked in series to form the minimum unit layer of the method, and an L minimum unit layer is stacked, so that higher-level local and global features are extracted.

S7: and using the global features obtained in the step for classification.

Preferably, step S1 is to pre-process pictures, and if there are few input pictures, convert them into tensor data types by a data enhancement method.

Preferably, in step S2, a picture with pixels of C × H × W is sliced into individual slices

A plurality of Patch, each Patch is C × H₁×W₁Then each Patch is generated to 1 XCH₁W₁Vector of (2) to realize slicingUsing convolution and Flatten operations with input dimensions (B, C, H, W) and output dimensions (B, N, CH)₁W₁) Where C is the number of channels in the picture and B is the Batch Size. In addition, Patch0 is added as a classification feature, that is, M is N +1 patches, so the output dimension of this step is (B, M, CH)₁W₁). Position information is added to each Patch (including the Patch0), so that the self-attention mechanism can better learn that even if the same picture is used, the Patch positions are different, and the obtained classification result is different.

Preferably, the internal characteristic extraction of Patch in step S3 uses convolutional neural network, so the output dimensions (B, M, CH) of the previous step are used₁W₁) Become (B, M, C, H)₁,W₁) (ii) a To make the input and output width and height the same, 0 padding is used and CH is used₁W₁A convolution kernel, from CH₁W₁Extracting internal features of Patch in each dimension to obtain CH₁W₁A H₁×W₁Feature maps, i.e. dimensions are (CH)₁W₁,H₁,W₁) (ii) a Then using global average pooling for each feature map, get (CH)₁

W

₁1, 1); through the first linear layer, the output dimension is set to

Wherein beta is a scaling factor, the function Relu is activated, the formula is expressed as formula (1),

wherein

b₁Is a trainable parameter, X₁Is an input, X₂Is the output; a second linear layer, the input dimension being

Output dimension is dim ═ CH₁W₁Laser, laserObtaining the weight of each channel by using a function softmax, wherein the formula is expressed as a formula (2),

wherein

b₂Is a trainable parameter, X₂Is an input, X₃Outputting; multiplying the feature maps of each channel respectively, and then adding all the feature maps to obtain f being 1 XH₁×W₁The characteristic diagram contains CH₁W₁The fusion of the internal characteristic information of the Patch extracted by dimensionality has a formula shown as a formula (3),

f＝x₃₁c₁+x₃₂c₂+…+x_3ic_i i＝1,2,…,CH₁W₁formula (3)

Wherein x_3iIs X₃Element (c) of_iExtracting each feature map inside the Patch by using a convolutional neural network; after Flatten, through the linear layer, the dimension is raised to CH₁W₁. It can be seen that the input and output dimensions of the SE module are (B, M, CH)₁W₁) Where each picture in Batch shares the SE module, the number of parameters can be reduced.

Preferably, the step S4 is to extract global features from different dimensions by using a multi-head self-attention mechanism, and the self-attention mechanism process can be represented by the following process:

first, through the operations of linear layer and dimension conversion, three tensors Q, K, V are initialized, which are aimed at training the three tensors, and the dimensions are all the same

Where B denotes Batch Size, H denotes the number of heads of the multi-head attention mechanism, M denotes the number of input attention mechanisms Patch (including Patch0), D ═ C, H₁,W₁Representing the dimensions of each Patch.

Therefore, dimension W is (B, H, M, M), where the ith row and jth column element in dimension 2 represents the weight of the ith Patch to the jth Patch.

A ═ WV formula (5)

So that the dimension of A is

And A aggregates the characteristic information of the whole picture, and then converts the dimension of A into (B, M, D) and outputs the (B, M, D) to the next layer.

Preferably, the two-layer perceptron of step S5 can be expressed by the following formula:

a first layer:

wherein, X₄Is an input, X₅Is the output of the computer system,

b₄is an offset, where α is a reduction factor, W₄And b₄Are training parameters.

A second layer:

wherein, X₅Is an input, X₆Is the output of the computer system,

b₅is an offset, X₅And b₅Are training parameters. The dimension of the output of this step is also (B, M, D).

Preferably, there is a normalization operation before each module is input as described in step S6, and then each module adds a Shortcut connection.

Preferably, in step S7, the range of attention paid by each minimum unit layer Patch0 is different, so that the extracted global information is different, and therefore, the Patch0 output by each minimum unit layer multi-layer perceptron is extracted and recorded as u_i∈R^1×DI-1, 2, …, L, then,

P＝k₁u₁+k₂u₂+…+k_iu_i1,2, …, L formula (9)

out is softmax (P) formula (10)

Wherein k is_iRepresents u_iE ∈ R^D×1The method is characterized in that the training parameters are P, the P represents global features fused according to different weights of Patch0 output by each minimum unit layer, finally the P obtains classification confidence coefficient by inputting a softmax layer, and the class with the highest confidence coefficient is used as a prediction result.

Compared with the prior art, the invention has the following beneficial effects:

firstly, before the input of the self-attention mechanism, the SE module is used for extracting internal features of the Patch, so that the Patch vector representation of the input self-attention mechanism is richer, more features are utilized, the classification accuracy is improved, and the calculation amount is less than that of a transform in transform architecture.

Secondly, the Ptach0 output of each minimum unit layer is taken out, then corresponding weights are distributed, the weights are obtained through automatic learning, the output of each minimum unit layer Patch0 is multiplied by the corresponding weights, and then the weights are added, so that the global features extracted by each minimum unit layer can be utilized, the features of the input softmax layer are richer, and the classification accuracy is improved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and the drawings are only for illustrative purposes and should not be construed as limiting the invention.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow diagram of the SE module of the present invention;

FIG. 3 is a flow chart of the minimum cell level of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example (b):

an image classification based on an SE module and a self-attention mechanism network is shown in FIG. 1, and the model comprises 3 parts. The first part is to cut a picture into several lots and add lots 0 for classification features, as well as position information. The second part is the minimum unit layer, which includes the SE module to extract local features inside each Patch, extract global features between each Patch from the attention mechanism, and the multi-layer perceptron, which determines the stack L minimum unit layer as needed. And the third part is to take the Patch0 output of each minimum unit layer, respectively give different weights, then fuse the weights, and input the weights into the softmax layer to obtain a prediction result.

A first part: assume that a color picture has 3 × 224 × 224 pixels, and each Patch has 3 × 16 × 16 pixels, there

Patch, sliced with a convolution operation with parameters set to: convolution kernel 3 × 16 × 16, step size (1,1), no offset, number of convolution kernels set to CH₁W₁The input dimension is (B,3,224,224), the feature map dimension obtained through convolution operation is (B,768,14,14), the dimension after the scatter operation is (B,768,196), the 1 st and 2 nd dimensions are exchanged, and the dimension is (B,196,768). Wherein C is the number of picture channels, and the color picture channels are 3, H₁、W₁For each Patch height and width, 16 in this embodiment, and B is Batch Size. Adding Patch0, i.e. M + N + 1-197 Patch, beginning with Patch0 as all 0 vectors, and training to obtain the final productTo a vector representing a global feature, then the partial output dimension is (B,197,768). Position information is added to each Patch (including the Patch0), so that the self-attention mechanism can better learn that even if the same picture is used, the Patch positions are different, and the obtained classification result is different.

A second part: first, the SE module, whose flowchart is shown in fig. 2, changes the output dimension (B,197,768) of the previous step into (B,197,3,16,16) because the convolutional neural network is used; the convolution parameters are set as: convolution kernels 3 × 3 × 3, step size (1,1), bias, 0 padding, one row up and down, one column left and right, 768 convolution kernels, and internal characteristics of Patch extracted from 768 dimensions to obtain 768 16 × 16 characteristic graphs, namely dimensions (768,16,16) in order to make the input and output width and height the same; then using global average pooling for each feature map, resulting in (768,1, 1); through the first linear layer, the output dimension is set to

Where the scaling factor is 16, the activation function Relu, the formula is expressed as equation (1),

wherein W₁∈R^768×48、b₁Is a trainable parameter, X₁Is an input, X₂Is the output; the second linear layer, with input dimension dim 48 and output dimension dim 768, activates the softmax function, resulting in a weight per channel, as expressed by equation (2),

wherein W₂∈R^48×768、b₂Is a trainable parameter, X₂Is an input, X₃Outputting; multiplying the feature maps of each channel respectively, and then adding all the feature maps to obtain a feature map with f being 1 × 16 × 16, wherein the feature map packetThe method contains the fusion of characteristic information in the Patch extracted by 768 dimensions, the formula is expressed as formula (3),

f＝x₃₁c₁+x₃₂c₂+…+x_3i

c

_i1,2, …,768 formula (3)

Wherein x_3iIs X₃Element (c) of_iExtracting each feature map inside the Patch by using a convolutional neural network; after Flatten, the dimensions are raised to 768 through the linear layers. It can be seen that the input and output dimensions of the SE module are (B,197,768), wherein each picture in Batch shares the SE module, so that the parameter number can be reduced.

Then, a multi-head self-attention mechanism extracts global features from different dimensions, the number of heads of the multi-head self-attention mechanism is set to be 8, three tensors Q, K, V are initialized through linear mapping and dimension conversion, the dimensions are (B,8,176 and 96), and the purpose is to train the three tensors, wherein B represents Batch Size. Passing type (4)

A weight tensor W may be derived, dimension (B,8,176,176), where the ith row and jth column element of dimension 2 represents the weight of the ith Patch to the jth Patch.

A ═ WV formula (5)

The dimension of A is (B,8,197 and 96), the A aggregates the local and global features of the whole picture, and then the dimension of A is converted into (B,197,768) to output the next layer.

Then two layers of multilayer perceptrons are put into the first layer,

wherein, X₄Is an input, X₅Is an output, W₄∈R^768×48，b₄Is an offset, where the reduction factor is set to 16, W₄And b₄Are training parameters. A second layer:

wherein, X₅Is an input, X₆Is an output, W₅∈R^48×768，b₅Is an offset, X₅And b₅Are training parameters. The dimension of the output of this step is also (B,197,768).

As shown in fig. 3, before inputting to the SE module, the attention mechanism, and the multi-layer sensor, normalization operation is performed, and then Shortcut connection is added, so that serial stacking constitutes the minimum unit layer provided by the present invention, and different layers can be stacked according to requirements.

And a third part: because the attention range of each minimum unit layer Patch0 is different, the extracted global features are different, and when the features are used, the features need to be fused, the stack 6 minimum unit layers are set, and the Patch0 of the multi-layer perceptron output of each minimum unit layer is extracted and recorded as u_i∈R^1×768And i is 1,2, …,6, and weights are obtained by using a softmax function, wherein the higher the weight is, the more important the global features of the minimum unit layer are represented. The formula is as follows:

P＝k₁u₁+k₂u₂+k₃u₃+k₄u₄+k₅u₅+k₆u₆formula (9)

out is softmax (P) formula (10)

Wherein k is_iRepresents u_iE ∈ R^768×1The method is characterized in that the training parameters are P, the P represents global features fused according to different weights of Patch0 output by each minimum unit layer, finally the P obtains classification confidence coefficient by inputting a softmax layer, and the class with the highest confidence coefficient is used as a prediction result.

The above examples are merely illustrative for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. An image classification method based on an SE module and a self-attention mechanism network is characterized by comprising the following steps:

S3: features are extracted internally for each Patch using the SE module.

S4: features between each Patch are extracted using a self-attention mechanism.

S5: the output data from the attention mechanism is fed into two layers of MLPs (multilayer perceptron).

S7: and using the global features obtained in the step for classification.

2. The image classification method based on the SE module and the attention mechanism network as claimed in claim 1, wherein step S1 is to pre-process the pictures and if there are few input pictures, convert them into tensor data types by data enhancement method.

3. The image classification method based on the SE module and the attention mechanism network as claimed in claim 1, wherein the step S2 is to slice a picture with C x H x W pixelsBecome one

A plurality of Patch, each Patch is C × H₁×W₁Then each Patch is generated to 1 XCH₁W₁The vector of (C), the implementation of slicing is performed using convolution and Flatten operations with input dimensions (B, C, H, W) and output dimensions (B, N, CH)₁W₁) Where C is the number of channels in the picture and B is the Batch Size. In addition, Patch0 is added as a classification feature, that is, M is N +1 patches, so the output dimension of this step is (B, M, CH)₁W₁). Position information is added to each Patch (including the Patch0), so that the self-attention mechanism can better learn that even if the same picture is used, the Patch positions are different, and the obtained classification result is different.

4. The image classification method based on the SE module and the attention mechanism network as claimed in claim 1, wherein the Patch internal feature extraction in step S3 adopts a convolutional neural network, so that the output dimensions (B, M, CH) of the previous step are selected₁W₁) Become (B, M, C, H)₁，W₁) (ii) a To make the input and output width and height the same, 0 padding is used and CH is used₁W₁A convolution kernel, from CH₁W₁Extracting internal features of Patch in each dimension to obtain CH₁W₁A H₁×W₁Feature maps, i.e. dimensions are (CH)₁W₁，H₁，W₁) (ii) a Then using global average pooling for each feature map, get (CH)₁W₁1, 1); through the first linear layer, the output dimension is set to

wherein

Output dimension is dim ═ CH₁W₁Activating a function softmax to obtain the weight of each channel, wherein the formula is expressed as a formula (2),

wherein

f＝x₃₁c₁+x₃₂c₂+...+x_3ic_i i＝1，2，...，CH₁W₁formula (3)

5. The image classification method based on the SE module and the self-attention mechanism network as claimed in claim 1, wherein the step S4 is to extract global features from different dimensions by using a multi-head self-attention mechanism, and the self-attention mechanism process can be represented by the following process:

Where B denotes Batch Size, H denotes the number of heads of the multi-head attention mechanism, M denotes the number of input attention mechanisms Patch (including Patch0), and D ═ CH₁W₁Representing the dimensions of each Patch.

A ═ WV formula (5)

So that the dimension of A is

6. The image classification method based on the SE module and the attention mechanism network as claimed in claim 1, wherein the two-layer perceptron of step S5 can be represented by the following formula:

a first layer:

wherein, X₄Is an input, X₅Is the output of the computer system,

A second layer:

wherein, X₅Is an input, X₆Is the output of the computer system,

7. The image classification method based on the SE module and the attention mechanism network as claimed in claim 1, wherein step S6 is performed by normalization before inputting each module, and then each module adds Shortcut connection.

8. The image classification method based on the SE module and the attention mechanism network as claimed in claim 1, wherein in step S7, the range of attention of each minimum unit layer Patch0 is different, so that the extracted global information is different, and therefore the Patch0 of the multi-layer perceptron output of each minimum unit layer is extracted and recorded as u_i∈R^1×D1,2, L, then,

P＝k₁u₁+k₂u₂+...+k_iu_i1,2, L formula (9)

out is softmax (P) formula (10)