CN113111970B

CN113111970B - Method for classifying images by constructing global embedded attention residual network

Info

Publication number: CN113111970B
Application number: CN202110487497.4A
Authority: CN
Inventors: 裴炤; 万志杨; 张艳宁; 马苗
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2023-12-26
Anticipated expiration: 2041-04-30
Also published as: CN113111970A

Abstract

The present disclosure discloses a method of classifying images by building a global embedded attention residual network, comprising: preprocessing the image data to be classified; constructing a global embedded attention residual network containing a global embedded attention module, wherein the global embedded attention residual network comprises 1 input layer, 1 convolution layer with a convolution kernel size of 7×7, 1 max pooling layer, a global embedded attention module, 2 fully connected layers and 1 output layer, and the global embedded attention module comprises a spatial attention sub-module based on global context and a channel attention sub-module based on coordinates; and inputting the preprocessed image data to be classified into a global embedded attention residual error network for classification.

Description

Method for classifying images by constructing global embedded attention residual network

Technical Field

The present disclosure relates to an image classification method, and more particularly, to a method of classifying images by constructing a global embedded attention residual network.

Background

Image classification is an important task in the field of computer vision. At present, many scholars utilize a method of adding an attention mechanism to improve a network structure, so that image classification can be better performed. The most classical extrusion and excitation networks have been considered as milestones of the attention mechanism by two-step operation of extrusion and excitation, which first uses global averaging pooling to extrude global features into channel features, then uses a simple gating mechanism and excitation using sigmoid functions, and finally the corresponding channel products. The method can adaptively recalibrate the characteristic response of the channel by modeling the inter-dependencies between channels with the aid of a 2D global pool, providing significant performance improvements at a relatively low computational cost. However, it only considers the encoding of inter-channel information, ignoring the importance of location information, which is critical for capturing object structures in computer vision tasks. In addition, the later scholars try to combine the spatial attention information with the channel attention information, but the benefit brought by using only the position information of the local space is not high, so that the local position information of the channel can be effectively utilized while the global position information is added into the neural network.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method for classifying images by constructing a global embedded attention residual network, and the classification capability of the images is improved by utilizing global position information and embedding the global position information into channel information to effectively extract image features.

In order to achieve the above object, the present disclosure provides the following technical solutions:

a method of classifying images by constructing a global embedded attention residual network, comprising the steps of:

s100: preprocessing the image data to be classified;

s200: constructing a global embedded attention residual network containing a global embedded attention module, wherein the global embedded attention residual network comprises 1 input layer, 1 convolution layer with a convolution kernel size of 7×7, 1 max pooling layer, a global embedded attention module, 2 fully connected layers and 1 output layer, and the global embedded attention module comprises a spatial attention sub-module based on global context and a channel attention sub-module based on coordinates;

s300: and inputting the preprocessed image data to be classified into a global embedded attention residual error network for classification.

Preferably, after the global embedded attention residual network is built, a training sample is required to be selected and preprocessed to train the network, a verification sample is required to be selected and preprocessed to adjust parameters of the trained network, and a test sample is required to be selected to test performance of the trained network.

Preferably, in step S200, the spatial attention submodule based on the global context includes:

the first subunit is used for inputting the preprocessed training samples, verification samples and test samples into the convolution layer and the pooling layer for processing and then performing global average pooling operation so as to obtain a feature matrix containing global information;

the second subunit is used for performing linear transformation on the feature matrix containing global information by adopting convolution and reshape functions with convolution kernel size of 1×1 so as to obtain a feature matrix subjected to dimension transformation processing;

the third subunit is configured to perform adaptive selection on the feature matrix subjected to the dimensional transformation processing by using a softmax function, obtain a corresponding weight of each different element on the feature matrix, and multiply the corresponding weight of each different element with the feature matrix containing global information to obtain a feature matrix containing global context feature information;

and a fourth subunit, configured to perform nonlinear transformation on the feature matrix containing the global context feature information by using batch normalization and a ReLU activation function and perform dimensional transformation by using 1×1 convolution.

Preferably, the global context-based spatial attention submodule is expressed as:

where x represents the output of global average pooling and y represents globalThe output of the contextual features, H and W representing the height and width of the input image, respectively, X representing the input image, K representing a 1X 1 convolution, reLU representing the ReLU activation function, BN representing the batch normalization function, N representing the number of elements in the feature matrix, e representing the base of the natural logarithmic function, i, j, m representing the possible positions of all elements in the feature matrix, respectively, X _j 、x _m Respectively representing the values of element information in the feature matrix, t represents the weight of the x matrix, tx _j And tx _m And representing the output value obtained by calculating the feature matrix after global average pooling operation.

Preferably, in step S200, the coordinate-based channel attention submodule includes:

a fifth subunit, configured to decompose the feature matrix containing the global context feature information after the nonlinear transformation and the dimensional transformation by adopting average pooling along the W 'direction and the H' direction, to obtain a one-dimensional feature matrix along the W 'direction and a one-dimensional feature matrix along the H' direction, where the one-dimensional feature matrix along the W 'direction includes local position information of the channel, and the one-dimensional feature matrix along the H' direction includes long-term dependency information;

a sixth subunit, configured to concatenate the one-dimensional feature matrix along the W 'direction and the one-dimensional feature matrix along the H' direction, and perform feature transformation with a convolution of 1×1 to obtain a feature matrix subjected to dimension transformation;

and a seventh subunit, configured to perform weight distribution on the feature matrix subjected to the dimension transformation by using a softmax function to obtain feature matrices with different weights, and perform feature transformation on the feature matrices with different weights by using 1×1 convolution, so as to obtain an output of the global embedded attention module.

Preferably, the feature matrix containing the global context feature information after the nonlinear transformation and the dimensional transformation is decomposed by adopting average pooling along the W 'direction and the H' direction, and the obtained one-dimensional feature matrix along the W 'direction and the obtained one-dimensional feature matrix along the H' direction are respectively expressed as:

where H 'and W' represent the height and width of the spatial attention sub-module output image of the global context, Z _H And Z _W Representing the one-dimensional feature matrix along the H 'direction and along the W' direction, respectively, i and j representing the H 'and W' of the ith row, respectively.

Preferably, the cascading of the one-dimensional feature matrix along the W 'direction and the one-dimensional feature matrix along the H' direction is performed by:

g＝K(z _w +z _h )

where g represents the output of the cascade operation and K represents a 1 x 1 convolution.

Preferably, the output of the global embedded attention module is expressed as:

z＝X(i,j)×a _c +X(i,j)×b _c

and is also provided with

Wherein A and B represent random numbers, ac and Bc represent initial values, a _c And b _c Representing the weights corresponding to feature matrices with different weights, e represents the natural logarithm, g ^H And g ^W The feature matrix after dimension transformation in step S306 is obtained by processing a ReLU activation function and then dividing the feature matrix into two matrices along the space dimension, wherein the dimension of the feature matrix after dimension transformation in step S306 is R ^C×(W+H) G after segmentation ^H And g ^W The dimension of the feature matrix is R respectively ^C×H And R is ^C×W 。

Preferably, the image data to be classified and the training sample, the verification sample and the test sample are preprocessed according to the following steps:

s201: performing horizontal and vertical overturning on the image data to be classified and the image data in the training sample, the verification sample and the test sample;

s202: rotating the flipped image data clockwise or anticlockwise;

s203: scaling the rotated image data;

s204: and carrying out average reduction processing on the zoomed image data.

Preferably, in step S204, the mean value reduction processing is performed on the scaled image data by the following formula:

wherein Z is the image after the mean value is subtracted, v _i The pixel matrix of the ith image in the n images, n is 100000 integer images.

Compared with the prior art, the beneficial effects that this disclosure brought are:

the method for embedding the global position information into the channel information is provided for constructing the global embedded attention residual error network, and the memory burden is greatly reduced through effective data preprocessing.

Drawings

FIG. 1 is a flow chart of a method of classifying images by building a global embedded attention residual network, provided by one embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a global embedded attention residual network provided by another embodiment of the present disclosure;

fig. 3 (a), fig. 3 (b), fig. 3 (c), fig. 3 (d) are schematic diagrams illustrating comparison of a global embedded attention residual network and an existing classification method according to another embodiment of the present disclosure.

Detailed Description

Specific embodiments of the present disclosure will be described in detail below with reference to fig. 1 to 3 (d). While specific embodiments of the disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It should be noted that certain terms are used throughout the description and claims to refer to particular components. Those of skill in the art will understand that a person may refer to the same component by different names. The specification and claims do not identify differences in terms of components, but rather differences in terms of the functionality of the components. As used throughout the specification and claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. The description hereinafter sets forth a preferred embodiment for practicing the invention, but is not intended to limit the scope of the invention, as the description proceeds with reference to the general principles of the description. The scope of the present disclosure is defined by the appended claims.

For the purposes of promoting an understanding of the embodiments of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific examples, without the intention of being limiting the embodiments of the disclosure.

In one embodiment, as shown in fig. 1, an image classification method based on a global embedded attention residual network includes the following steps:

s100: preprocessing the image data to be classified;

Compared with the existing method, the method can reduce model parameters of the deep neural network, fully utilize the context information and embed the context information into the channel information, realize global feature modeling, improve classification effect and solve the problems that the existing network is low in classification accuracy and difficult to combine the position information and the channel information.

In another embodiment, after the global embedded attention residual network is built, a training sample is required to be selected and preprocessed to train the network, a verification sample is required to be selected and preprocessed to adjust parameters of the trained network, and a test sample is required to test performance of the trained network.

In this embodiment, firstly, a plurality of image data are selected from any image data set including COCO, imageNet and ADNI, the selected image data are sorted into different image data subsets and are respectively used as a training sample, a verification sample and a test sample, then the selected training sample, the verification sample and the test sample are preprocessed, finally, the preprocessed training sample is input into the global embedded attention residual network for network training, after the network training is completed, the preprocessed verification sample is input into the trained network for adjusting network parameters, and the preprocessed test sample is input into the trained network for testing network performance, so that image classification is realized.

In another embodiment, in step S200, the global context-based spatial attention submodule includes:

the second subunit performs linear transformation on the feature matrix containing global information by adopting convolution and reshape functions with convolution kernel size of 1 multiplied by 1 so as to obtain a feature matrix subjected to dimension transformation processing;

the third subunit performs self-adaptive selection on the feature matrix subjected to dimension transformation processing by using a softmax function to obtain the corresponding weight of each different element on the feature matrix, and multiplies the corresponding weight of each different element by the feature matrix containing global information to obtain the feature matrix containing global context feature information;

In another embodiment, the global context based spatial attention submodule is expressed as:

where x represents the output of global average pooling, y represents the output of global context features, H and W are eachRepresenting the height and width of the input image, X representing the input image, K representing a 1X 1 convolution, reLU representing a ReLU activation function, BN representing a batch normalization function, N representing the number of elements in the feature matrix, e representing the base of a natural logarithmic function, i, j, m representing the possible positions of all elements in the feature matrix, X, respectively _j 、x _m Respectively representing the values of element information in the feature matrix, t represents the weight of the x matrix, tx _j And tx _m And representing the output value obtained by calculating the feature matrix after global average pooling operation.

In another embodiment, in step S200, the coordinate-based channel attention submodule includes:

and a seventh subunit, wherein the feature matrixes with different weights are obtained by carrying out weight distribution on the feature matrixes subjected to dimension transformation by utilizing a softmax function, and the feature matrixes with different weights are respectively subjected to feature transformation by utilizing 1×1 convolution, so that the output of the global embedded attention module is obtained.

In another embodiment, the feature matrix containing the global context feature information after the nonlinear transformation and the dimensional transformation is decomposed by adopting average pooling along the W 'direction and the H' direction, and the obtained one-dimensional feature matrix along the W 'direction and the obtained one-dimensional feature matrix along the H' direction are respectively expressed as:

In another embodiment, the cascading of the one-dimensional feature matrix along the W 'direction and the one-dimensional feature matrix along the H' direction is performed by:

g＝K(z _w +z _h )

In another embodiment, the output of the global embedded attention module is expressed as:

z＝X(i,j)×a _c +X(i,j)×b _c

and is also provided with

In another embodiment, the image data to be classified and the training sample, the verification sample and the test sample are preprocessed according to the following steps:

s202: rotating the flipped image data clockwise or anticlockwise;

in this step of the process, the process is carried out,

s203: scaling the rotated image data;

s204: and carrying out average reduction processing on the zoomed image data.

In this step, the mean value reduction processing is performed on the scaled image data by the following formula:

In another embodiment, the training of the preprocessed training samples by inputting them into the global embedded attention residual network is performed by:

s501: performing linear and nonlinear operation on the preprocessed training samples in a forward propagation mode;

s502: and carrying out chain derivation on the training samples subjected to linear and nonlinear operation in a counter-propagation mode, and updating the weight information of the network according to a preset learning rate until the maximum iteration number is reached.

In this embodiment, training samples are sequentially processed through an input layer, a convolution layer, a maximum pooling layer, a global embedded attention module, a full connection layer and an output layer in a forward propagation manner, and then chain derivation is performed in a reverse propagation manner, and weight information of a network is updated from the output layer sequentially through the full connection layer, the global embedded attention module, the maximum pooling layer, the convolution layer and the input layer according to a preset learning rate a (a=0.01 and a gradually decreases with an increase of the network layer number, and a=a/5 every 20 iterations are added), so that the training samples continuously reciprocate until the maximum iteration number is reached.

The method of the present disclosure is further described below in connection with specific examples.

Specific example 1:

1. 100000 Zhang Yangben images are selected from the ImageNet image classification data set as training samples, 10000 sample images are selected as verification samples and 30000 sample images are selected as test samples, and the training samples and the images in the test samples are not overlapped.

2. 100000 images in the training sample are preprocessed, which comprises the following steps:

a. performing horizontal overturn and vertical overturn on the image;

b. rotating the flipped image by 20 degrees in a clockwise or counterclockwise direction;

c. scaling the rotated image to obtain a training sample image with 224×224;

d. the training sample image is subjected to mean reduction processing through a formula (1), wherein the formula (1) is expressed as:

The preprocessing steps of the images in the verification sample and the test sample are the same as the above steps, and will not be repeated here.

3. Constructing a global embedded attention residual network containing a global embedded attention module, as shown in fig. 2, wherein the global embedded attention residual network comprises 1 input layer, 1 convolution layer with a convolution kernel size of 7×7, 1 max pooling layer, a global embedded attention module, 2 fully connected layers and 1 output layer, and the global embedded attention module comprises a spatial attention sub-module based on global context and a channel attention sub-module based on coordinates;

wherein the global context based spatial attention submodule comprises:

Wherein the first subunit is implemented by formula (2), where formula (2) is expressed as:

the second to fourth sub-units are realized by the formula (3), and the formula (3) is expressed as:

where X represents the output of global average pooling, y represents the output of global context features, H and W represent the height and width of the input image, respectively, and X represents the inputThe image, K represents a 1 x 1 convolution, reLU represents a ReLU activation function, BN represents a batch normalization function, N represents the number of elements in the feature matrix, e represents the base of a natural logarithmic function, i, j, m represent the possible positions of all elements in the feature matrix, x, respectively _j 、x _m Respectively representing the values of element information in the feature matrix, t represents the weight of the x matrix, tx _j And tx _m And representing the output value obtained by calculating the feature matrix after global average pooling operation.

The coordinate-based channel attention submodule includes:

Wherein the fifth subunit is implemented by formulas (4) and (5), where formulas (4) and (5) are expressed as:

the sixth subunit is implemented by equation (6), equation (6) being expressed as:

g＝K(z _w +z _h ) (6)

the seventh subunit is realized by formulas (7) - (9), formulas (7) - (9) being expressed as:

z＝X(i,j)×a _c +X(i,j)×b _c (9)

where H 'and W' represent the height and width of the spatial attention sub-module output image of the global context, Z _H And Z _W Representing one-dimensional feature matrices in the H 'direction and in the W' direction, respectively, i and j representing H 'of the ith row and W' of the jth row, respectively, g representing the output of the cascade operation, K representing a 1 x 1 convolution, a and B representing random numbers, ac and Bc representing initial values, a _c And b _c Representing the weights corresponding to feature matrices with different weights, e represents the natural logarithm, g ^H And g ^W The feature matrix after dimension transformation in step S306 is obtained by processing a ReLU activation function and then dividing the feature matrix into two matrices along the space dimension, wherein the dimension of the feature matrix after dimension transformation in step S306 is R ^C×(W+H) G after segmentation ^H And g ^W The dimension of the feature matrix is R respectively ^C ^×H And R is ^C×W 。

It should be noted that, the number of the global embedded attention modules is 16, each global embedded attention module is correspondingly embedded in the residual error structure, the convolution layer outside the global embedded attention mechanism can selectively perform batch normalization function and excitation function to perform nonlinear transformation, the output layer uses the full connection layer and softmax function to output the probability of the category to which each input image belongs, and the category with the maximum probability is used as the predicted category.

It should be further noted that 16 global embedded attention modules are selected because the attention modules may be embedded in multiple networks, such as, for example, resnet50, and the disclosure describes based on a structure of resnet50, and the structure of resnet50 is (3,4,6,3) a total of 16 residual blocks, so the disclosure selects 16 global embedded attention modules.

4. And inputting the preprocessed training sample containing 100000 images into the global embedded attention residual error network to train the training sample, and repeatedly and circularly updating the weight of the network through two steps of forward propagation and reverse propagation until the maximum iteration times reach 70-120 times, and ending the training process to obtain a trained residual error network model. And inputting the preprocessed verification sample containing 100000 images and the preprocessed test sample containing 30000 images into a trained network for verification and testing, so as to realize image classification.

Test samples were used to test and compared to the Top-1 accuracy and Top-5 accuracy on the COCO and ImageNet datasets, respectively, with existing classification methods including CA attention mechanisms, SE attention, BAM attention, and CBAM attention mechanisms, with the results shown in tables 1 and 2:

TABLE 1

TABLE 2

As can be seen from Table 1, the global embedded attention residual network added with the global embedded attention module has good performance in COCO or ImageNet data sets, with the reference of resnet-50, the highest Top-1acc can reach 75.9, the highest Top-5acc can reach 86.6, and the highest Top-1acc and Top-5acc respectively reach 75.8 and 83.1 in ImageNet data sets, so that the global embedded attention residual network has a certain improvement compared with other models.

As can be seen from Table 2, if Resnet-101 is used, there is a certain improvement in Top-1acc and Top-5acc, which indicates that the method disclosed by the disclosure has better generalization performance and can better classify images.

Specific example 2:

327 structural magnetic resonance image data were selected from the ADNI dataset, including 119 brain MRI of mild cognitive impairment patients, 101 brain MRI of alzheimer patients, and 107 normal human brain MRI. According to 7:3 into training sample data and test sample data, training and verifying the training sample data by using a 10-fold cross verification method, specifically: the training sample data are divided into 10 parts, the number is 0-9, wherein 0-8 is assumed, 9 parts are taken as training sets, 9 parts are taken as verification sets, after training and verification are finished, 8 parts are taken as verification sets, the rest are taken as test sets, training and verification are performed, and the like, and the training and verification are performed for 10 times. Finally, testing by using a test sample, and comparing the accuracy, recall and precision of the classification method comprising CA attention mechanism, SE attention, BAM attention and CBAM attention mechanism with the existing classification method on an ADNI data set, wherein the comparison results are shown in tables 3 and 4:

TABLE 3 Table 3

TABLE 4 Table 4

As can be seen from tables 3 and 4, the accuracy of the global embedded attention residual network added with the global embedded attention module is up to 88.5 when the Resnet-50 is taken as a basic model, and is up to 90.5 when the Resnet-101 is taken as a basic model.

To further verify the technical effect of the method of the present disclosure, the present disclosure applies the method to three data sets, and selects the results of a portion of the tests for visual display. As shown in fig. 3 (a) to 3 (d), where fig. 3 (a) is a non-attention mechanism, fig. 3 (b) is a SE attention mechanism, fig. 3 (c) is a CA attention mechanism, and fig. 3 (d) is a GEA attention mechanism (i.e., a global embedded attention residual network). From the results, the GEA mechanism proposed by the user can restrict the network more, so that the network can concentrate the more prominent features of the image rather than focusing on the whole image, the network can more prominently focus on the region of interest, the most prominent features of the image can be found out, and the classification accuracy of the network on the image can be greatly improved.

The basic principles of the present disclosure have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.

Claims

1. A method of classifying images by constructing a global embedded attention residual network, comprising the steps of:

s100: preprocessing the image data to be classified;

the global context-based spatial attention submodule includes:

the third subunit performs self-adaptive selection on the feature matrix subjected to dimension transformation processing by using a so and max function to obtain the corresponding weight of each different element on the feature matrix, and multiplies the corresponding weight of each different element by the feature matrix containing global information to obtain the feature matrix containing global context feature information;

a fourth subunit, configured to perform nonlinear transformation on the feature matrix including the global context feature information by using batch normalization and a ReLU activation function, and perform dimensional transformation by using 1×1 convolution; the global context based spatial attention submodule is expressed as:

wherein X is represented as the output of global average pooling, y is represented as the output of global context features, H and W are represented as the height and width of the input image, respectively, X is represented as the input image, K is represented as a 1X 1 convolution, reLU is represented as a ReLU activation function, BN is represented as a batch normalization function, N is represented as the number of elements in the feature matrix, e is represented as the base of a natural logarithmic function, i, j, m are represented as the possible positions of all elements in the feature matrix, respectively, X _j 、x _m Respectively representing the values of element information in the feature matrix, t represents the weight of the x matrix, tx _j And tx _m Representing the output value obtained by calculating the feature matrix after global average pooling operation；

The coordinate-based channel attention submodule includes:

a seventh subunit, performing weight distribution on the feature matrix subjected to the dimension transformation by using a softmax function to obtain feature matrices with different weights, and performing feature transformation on the feature matrices with different weights by using 1×1 convolution respectively to obtain the output of the global embedded attention module;

2. The method of claim 1, wherein after the global embedded attention residual network construction is completed, a training sample is selected and preprocessed to train the network, a verification sample is selected and preprocessed to adjust parameters of the trained network, and a test sample is selected to perform performance test on the trained network.

3. The method of claim 1, wherein the feature matrix containing the global context feature information after the nonlinear transformation and the dimensional transformation is decomposed by adopting average pooling along the W 'and the H' directions, and the obtained one-dimensional feature matrix along the W 'direction and the obtained one-dimensional feature matrix along the H' direction are respectively expressed as:

4. The method of claim 1, wherein the cascading of the one-dimensional feature matrix along the W 'direction and the one-dimensional feature matrix along the H' direction is performed by:

g＝K(z _w +z _h )

5. The method of claim 1, wherein the output of the global embedded attention module is represented as:

z＝X(i，j)×a _c +X(i，j)×b _c

and is also provided with

Wherein A and B represent random numbers, ac and Bc represent initial values, a _c And b _c Representing the weights corresponding to feature matrices with different weights, e represents the natural logarithm, g ^H And g ^W Is performed by the process in step S306After the feature matrix subjected to the dimension transformation is processed by the ReLU activation function, the feature matrix is segmented into two matrices along the space dimension, and the dimension of the feature matrix subjected to the dimension transformation in step S306 is R ^C×(W+H) G after segmentation ^H And g ^W The dimension of the feature matrix is R respectively ^C×H And R is ^C×W 。

6. The method according to claim 2, wherein the image data to be classified and the training, validation and test samples are preprocessed according to the following steps:

s202: rotating the flipped image data clockwise or anticlockwise;

s203: scaling the rotated image data;

s204: and carrying out average reduction processing on the zoomed image data.

7. The method according to claim 6, wherein in step S204, the process of reducing the mean value of the scaled image data is performed by: