CN113408577A

CN113408577A - Image classification method based on attention mechanism

Info

Publication number: CN113408577A
Application number: CN202110517855.1A
Authority: CN
Inventors: 徐智; 宁文昌; 李智
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-09-17

Abstract

The invention relates to the field of image data processing, and discloses an image classification method based on an attention mechanism, which comprises the steps of carrying out frequency decomposition on each channel of a characteristic diagram based on discrete cosine transform, jointly representing channel global information by a plurality of frequency components, and then calculating channel attention weight information; weighting each channel of the feature map based on the weight information to obtain a channel attention mechanism, then calculating the spatial attention weight of each pixel of the feature map, and then weighting and summing each spatial pixel of the feature map to obtain the spatial attention mechanism; embedding a channel attention mechanism and a space attention mechanism into ResNet to obtain an image classification convolution neural network, and training. According to the invention, the global information of the channel can be better represented by combining a plurality of frequency components in the attention of the channel; a self-attention mechanism is adopted in the space attention to acquire global information on the space dimension of the feature map, and space weight distribution which is better than the space attention of the traditional convolution implementation can be obtained.

Description

Image classification method based on attention mechanism

Technical Field

The invention relates to the field of image data processing, in particular to an attention mechanism-based image classification method.

Background

The invention designs an image classification method based on a convolutional neural network, and embeds a novel attention mechanism in the convolutional neural network. The backbone part of the convolutional neural network is realized by a residual error network, and the attention mechanism comprises channel attention and space attention, so that the related background technology mainly comprises 3 items: a residual network; a channel attention mechanism; the spatial attention mechanism.

The residual error network is a neural network characterized by short circuit connection, wherein the short circuit connection refers to that the outputs of layers with different depths are added in the neural network to be used as the input of a subsequent layer, and the connection mode can enable the network to be more easily fitted with complex functions on one hand, and can realize identity mapping on the other hand, so that the performance of the network is not degraded when the depth is deepened, and a deeper network structure can be trained. Because the residual error network has good feature extraction capability, many tasks related to deep learning, such as target detection, image classification, video understanding and the like, are used as a backbone network for feature extraction. The residual network used in the present invention is ResNet.

The feature map output by each layer of the convolutional neural network comprises a plurality of channels, and each channel captures a visual feature in the input image. For many deep learning tasks, including image classification, different visual features in the input image contribute differently to the classification task. If convolutional neural networks can give more attention to important features, more complex learning tasks can be handled with limited network capacity. The channel attention mechanism is to give different weights to different channels of the feature map, so as to realize different attention degrees to different visual features. The main stream channel attention mechanism generally achieves the purposes of highlighting important features and suppressing irrelevant features by calculating the global information of each channel, modeling the importance of each channel based on the global information, calculating the weights of different channels according to the importance of different channels, and finally weighting different channels.

The feature map output by each layer of the convolutional neural network contains certain spatial information, and each 'pixel' on the feature map corresponds to an area in the input image. When a certain visual feature appears in a certain area of the input image, a larger activation value appears in the corresponding "pixel" of the corresponding channel in the feature map. Different spatial positions of the feature map may reflect features at different spatial positions of the input image. Similar to the channel attention mechanism, features at different spatial locations in the input image have different degrees of importance to the learning task, and if the convolutional neural network can give more attention to important regions in the image, more complex learning tasks can be processed with limited network capacity. The spatial attention mechanism is realized by giving different weights to different spatial positions of the feature map, so that different attention degrees are given to different areas in the input image. The main stream spatial attention mechanism generally calculates global information of each spatial position of a feature map in a channel dimension, then uses an additionally added convolutional layer to generate a spatial attention distribution map, each pixel of the distribution map represents a weight of one spatial position, and finally weights different spatial positions of the feature map by using the spatial attention distribution map, so that the purposes of highlighting features of important regions in an image and weakening features of irrelevant regions in the image can be achieved.

There are many image classification methods to improve the classification effect of the model by embedding the channel attention mechanism and the spatial attention mechanism into the neural network. In the existing channel attention mechanism, a common method for extracting global information is global average pooling or global maximum pooling, but both methods have information loss and cannot sufficiently extract the global information of one channel, so that a weight distribution scheme of the channel attention mechanism is not optimal, and the expression capability of the features extracted by the convolutional neural network is limited.

In the existing spatial attention mechanism, a common convolutional layer is commonly used for calculating spatial attention distribution, but the spatial attention distribution is limited by the size of a convolutional kernel, and global information on a spatial dimension cannot be extracted, so that a weight distribution scheme of the spatial attention mechanism is not globally optimal, and the expression capability of features extracted by a convolutional neural network is also limited.

Disclosure of Invention

The invention aims to provide an attention mechanism-based image classification method, which designs better global information representation methods for a channel attention mechanism and a space attention mechanism respectively, embeds the two attention mechanisms into ResNet simultaneously, improves the image classification effect of ResNet, and balances the improvement of the network classification effect and the increase of the calculated amount by optimizing the embedding mode of the attention mechanism.

In order to achieve the above object, the present invention provides an attention-based image classification method, including: carrying out frequency decomposition on each channel of the characteristic diagram based on discrete cosine transform to obtain a plurality of frequency components, and jointly representing channel global information by using the frequency components;

calculating channel attention weight information based on the channel global information, and weighting each channel of the feature map based on the weight information to obtain a channel attention mechanism;

calculating a spatial attention weight of each pixel of the feature map based on a self-attention mechanism, and weighting and summing the spatial pixels of the feature map to obtain a spatial attention mechanism;

embedding a channel attention mechanism and a space attention mechanism into ResNet to obtain an image classification convolutional neural network, and training the image classification convolutional neural network.

The specific steps of performing frequency decomposition on each channel of the characteristic diagram based on discrete cosine transform to obtain a plurality of frequency components, and jointly representing the channel global by using the plurality of frequency components are as follows:

calculating two-dimensional discrete cosine transform for each channel of the characteristic diagram to obtain a plurality of frequency components;

the 3 frequency components are selected to be spliced into a vector.

The specific steps of calculating the channel attention weight information based on the channel global information and weighting each channel of the feature map based on the weight information to obtain the channel attention mechanism are as follows:

reducing the dimension of the vector by using one-dimensional convolution;

performing dimension reduction on the vector obtained by the one-dimensional convolution again by using the full-connection layer;

processing the vectors subjected to dimension reduction of the full-connection layer by a nonlinear activation function;

the vector output by the nonlinear activation function is subjected to dimensionality raising through a layer of full-connection layer to form a dimensionality which is the same as the number of the characteristic graph channels, and the sigmoid function is used for normalization, so that channel attention distribution is obtained;

and weighting the characteristic diagram according to the channel attention distribution to obtain the output of the channel attention module.

The method comprises the following specific steps of calculating a spatial attention weight of each pixel of the feature map based on the self-attention mechanism, and weighting and summing the spatial pixels of the feature map to obtain the spatial attention mechanism:

calculating three vectors of query, key and value for each pixel of the feature map;

traversing each pixel of the input feature map, and calculating the correlation between each query vector and the key vectors of all pixels of the input feature map to obtain a correlation distribution map;

and carrying out weighted summation on the value vectors of all the pixels of the input feature map based on the correlation distribution map to obtain the pixel value at the corresponding position in the output feature map.

The query is a query vector and represents information related to a learning task, the key is a key vector and represents the attribute of the pixel, and the value is a value vector and represents the feature representation of the pixel.

Embedding a channel attention mechanism and a space attention mechanism into ResNet to obtain an image classification convolutional neural network, and training the image classification convolutional neural network specifically comprises the following steps:

embedding channel attention into the shallow building block set of the network: conv2_ x, conv3_ x, conv4_ x, set of deep building blocks that embed spatial attention into the network: conv5_ x;

the channel attention is connected behind the convolution module of the residual block, and the space attention replaces the 3 multiplied by 3 convolution layer in the convolution module of the residual block to obtain an image classification convolution neural network;

and training the image classification convolutional neural network.

The invention discloses an attention mechanism-based image classification method, which comprises the following steps: carrying out frequency decomposition on each channel of the characteristic diagram based on discrete cosine transform to obtain a plurality of frequency components, and jointly representing channel global information by using the frequency components; calculating channel attention weight information based on the channel global information, and weighting each channel of the feature map based on the weight information to obtain a channel attention mechanism; calculating a spatial attention weight of each pixel of the feature map based on a self-attention mechanism, and weighting and summing the spatial pixels of the feature map to obtain a spatial attention mechanism; embedding a channel attention mechanism and a space attention mechanism into ResNet to obtain an image classification convolutional neural network, and training the image classification convolutional neural network.

Thereby having the following advantages:

1. in the channel attention, a plurality of frequency components are obtained through discrete cosine transform, and due to complementarity among the frequency components, the global information of the channel can be better represented by combining the frequency components;

2. in spatial attention, a self-attention mechanism is employed to obtain global information in the feature map spatial dimension. Since each output neuron of the self-attention mechanism has a global receptive field, a spatial weight distribution that is superior to the spatial attention of conventional convolution implementations can be obtained.

3. The channel attention and the space attention are respectively embedded into the shallow layer and the deep layer of the convolutional neural network, and due to the fact that the number of channels of the shallow layer of the network is small, the space dimensionality of the deep layer of the network is small, too much calculation amount cannot be additionally increased through the two embedded attention mechanisms, the network can utilize the advantages of the two attention mechanisms, and the image classification effect of the network is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a channel attention calculation method of the present invention;

FIG. 2 is a block diagram of the spatial attention of the present invention;

FIG. 3 is a graph of the comparison between the residual block of the present invention after embedding the attention module and the ResNet original residual block;

FIG. 4 is a schematic illustration of the embedded position of the channel attention and spatial attention in ResNet of the present invention;

FIG. 5 is a flow chart of an attention-based image classification method of the present invention;

FIG. 6 is a flowchart of the present invention, in which each channel of the feature map is frequency-decomposed based on discrete cosine transform to obtain a plurality of frequency components, and the global information of the channel is jointly represented by the plurality of frequency components;

FIG. 7 is a flowchart of the present invention for computing channel attention weight information based on channel global information, and weighting each channel of a feature map based on the weight information to obtain a channel attention mechanism;

FIG. 8 is a flow chart of the present invention for computing a spatial attention weight for each pixel of a feature map based on a self-attention mechanism, and then summing the spatial attention weights for each pixel of the feature map to obtain a spatial attention mechanism;

FIG. 9 is a flow chart of embedding a channel attention mechanism and a spatial attention mechanism into ResNet to obtain an image classification convolutional neural network and training the image classification convolutional neural network according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Referring to fig. 1 to 9, the present invention provides an image classification method based on attention mechanism, including:

s101, carrying out frequency decomposition on each channel of the characteristic diagram based on discrete cosine transform to obtain a plurality of frequency components, and jointly representing channel global information by using the frequency components;

the method comprises the following specific steps:

s201, calculating two-dimensional discrete cosine transform for each channel of the characteristic diagram to obtain a plurality of frequency components;

in order to calculate the attention of the channels, global information in one channel needs to be acquired firstly, and the invention performs discrete cosine transform on each channel of the characteristic diagram and then jointly represents the global information of one channel by using a plurality of frequency components. The two-dimensional discrete cosine transform can be written as:

wherein F represents a feature map, C, W, H represents the number of channels, width and height of the feature map, respectively, F_k(i, j) is the ith, j position in the kth channel of the feature map,

then represents channel F_kThe h, w components in the spectrum of the discrete cosine transform of (1).

In the existing channel attention mechanism, global average pooling is commonly used to obtain global information in one channel, and the definition of global average pooling is as follows:

the lowest frequency component of the discrete cosine transform can be found by combining the formulas (1) and (2)

Comprises the following steps:

as can be seen from equation (3), the lowest frequency component of the discrete cosine transform is proportional to the global average pooling result, which means that the global information extracted from each channel of the feature map by the existing channel attention mechanism is only the lowest frequency component of the channel.

S202, selecting 3 frequency components to splice into a vector;

in calculating the channel attention, the optimal result will be theoretically obtained if all frequency components are taken into account. But the number of frequency components resulting from the discrete cosine transform is the same as the number of dimensions of the original signal, i.e. for one F e R^C×H×WThe obtained frequency components will have the number of C × H × W, the calculation of all frequency components will make the calculation complexity too high, and many frequency components are small in the signal and can be ignored when calculating the attention of the channel.

And because the low-frequency component of the image contributes more to the classification than the high-frequency component, when the method is used for calculating the attention of channels, each channel only uses the component with lower frequency in discrete cosine transform

And

the channel attention calculation method is shown in fig. 1. Feature F is the input to the attention module and feature a is the output of the attention module. The channel attention is divided into two steps of global information extraction and attention distribution calculation. When global information is extracted, the k channel of the feature map is calculated according to the formula (1)

And

three frequency components, and the three frequency components of all channels are combined into three vectors, which are denoted as T_0,0、T_0,1And T_1,0Then, againWill T_0,0、T_0,1And T_1,0And (5) splicing to obtain the output with the dimension of 3 × C.

S102, calculating channel attention weight information based on the channel global information; weighting each channel of the feature map based on the weight information to obtain a channel attention mechanism;

the method comprises the following specific steps:

s301, reducing the dimension of the vector by using one-dimensional convolution;

wherein the convolution kernel size of the one-dimensional convolution layer is C, the convolution step length is also C, and the total C/r is C₁A filter. The dimension of the obtained one-dimensional convolution layer is

A one-dimensional vector of (2), wherein r₁Is a hyperparametric and r₁Is greater than 1. This step may reduce redundancy of channel information.

S302, performing dimension reduction on the vector obtained by the one-dimensional convolution again by using the full-connection layer;

dimensionality reduction of the vector into with fully-connected layers

Wherein r is₂Is also a hyperparametric and r₂Is a multiple of 3.

S303, processing the vectors subjected to the dimension reduction of the full connection layer by a nonlinear activation function;

the above dimension is defined as

Is processed by a non-linear activation function ReLU.

S304, the vectors output by the nonlinear activation function are subjected to dimensionality raising through a layer of full-connection layer to form a dimensionality which is the same as the number of the channels of the characteristic diagram, and the sigmoid function is used for normalization, so that channel attention distribution is obtained;

and (3) performing dimensionality raising on the vector output by the nonlinear activation function to form 1 × C through a layer of full-connection layer, normalizing all elements of the vector to be between [0 and 1] by using a sigmoid function, wherein the normalized vector is the channel attention distribution, and each element represents the weight of one channel in the feature map.

S305, weighting the characteristic diagram according to the channel attention distribution to obtain the output of the channel attention module.

S103, calculating a spatial attention weight of each pixel of the feature map based on the self-attention mechanism, and weighting and summing the spatial pixels of the feature map to obtain a spatial attention mechanism;

the invention adopts a self-attention mechanism to realize a space attention mechanism, and the method comprises the following specific steps:

s401, calculating three vectors of query, key and value for each pixel of the feature map based on a self-attention mechanism;

the query is a query vector representing information related to a learning task, the key is a key vector representing an attribute of the pixel itself, and the value is a value vector representing a feature representation of the pixel.

Let F be the signature of the input from the attention mechanism and A be the signature of the output. Wherein F^jIs the feature vector at the jth spatial position of the feature map F. To calculate the query, key and value vectors for each eigenvector, each eigenvector is multiplied by three matrices W, respectively_θ、W_φAnd W_gAs shown in formula (5):

wherein theta (F)^j)、φ(F^j) And g (F)^j) Respectively represent vectors F^jQuery, key and value vectors. W_θ、W_φAnd W_gAre all learnable matrices, implemented with a convolution of 1 x 1 in a convolutional neural network.

S402, traversing each pixel of the input feature map, and calculating the correlation between the query vector of each pixel and the key vectors of all pixels of the input feature map to obtain a correlation distribution map;

as shown in equation (6), the correlation distribution map is denoted as M for the pixel at the ith position of the feature map_i∈R^H×WThen of MThe j-th position is:

in the formula (6), exp (θ (F))ⁱ)^Tφ(F^j) Compute the correlation between the query vector for the ith location and the key vector for the jth location in the feature map.

It is the normalized coefficient of the correlation.

S403, carrying out weighted summation on the value vectors of all the pixels of the input feature map based on the correlation distribution map to obtain pixel values at corresponding positions in the output feature map.

As shown in equation (7), the pixel value at the ith position of the output feature map is:

s105, embedding a channel attention mechanism and a space attention mechanism into the ResNet to obtain an image classification convolutional neural network, and training the image classification convolutional neural network.

The method comprises the following specific steps:

s601 embeds channel attention into the shallow building block set of the network: conv2_ x, conv3_ x, conv4_ x, set of deep building blocks that embed spatial attention into the network: conv5_ x;

the location of the attention mechanism embedding of the present invention into the ResNet network is shown in FIG. 4. conv1 is the first convolutional layer of ResNet, the shallow blocks in the figure refer to conv2_ x, conv3_ x and conv4_ x in ResNet, and the deep blocks refer to conv5_ x, XN_sRepresenting the structural block repetition N within the dotted line_sSub, N_sThe value of (c) is determined by the ordinal number of the structure block group and the depth of ResNet, such as N in ResNet50, conv2_ x, conv3_ x, conv4_ x and conv5_ x_sTake 3, 4, 6 and 3, respectively.

S602 channel attention follows the convolution block of the residual block, while spatial attention replaces the 3 x 3 convolution layer in the convolution block of the residual block.

The manner in which the attention mechanism of the present invention is embedded in the ResNet network is shown in FIG. 3. Fig. 3 shows the comparison between the residual block structure after ResNet embedded channel attention, spatial attention, respectively, and the original residual block structure. Where (a) is the original residual block, (b) is the residual block after embedding the channel attention, and (c) is the residual block after embedding the spatial attention.

S603, training the image classification convolution neural network obtained in the step.

And training the ResNet embedded with the attention mechanism by using a training method of a common convolutional neural network in image classification, so that the image classification method based on the attention mechanism can be realized.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An attention mechanism-based image classification method is characterized in that,

the method comprises the following steps: carrying out frequency decomposition on each channel of the characteristic diagram based on discrete cosine transform to obtain a plurality of frequency components, and jointly representing channel global information by using the frequency components;

2. The method of image classification based on attention mechanism as claimed in claim 1,

the 3 frequency components are selected to be spliced into a vector.

3. The method of image classification based on attention mechanism as claimed in claim 2,

reducing the dimension of the vector by using one-dimensional convolution;

4. The method of image classification based on attention mechanism as claimed in claim 1,

the specific steps of calculating the spatial attention weight of each pixel of the feature map based on the self-attention mechanism and then weighting and summing the spatial pixels of the feature map to obtain the spatial attention mechanism are as follows:

5. The method of image classification based on attention mechanism as claimed in claim 4,

6. The method of image classification based on attention mechanism as claimed in claim 1,

the method comprises the following specific steps of embedding a channel attention mechanism and a space attention mechanism into ResNet to obtain an image classification convolutional neural network, and training the image classification convolutional neural network:

and training the image classification convolutional neural network.