CN113469198A

CN113469198A - Image classification method based on improved VGG convolutional neural network model

Info

Publication number: CN113469198A
Application number: CN202110734218.XA
Authority: CN
Inventors: 刘一柳; 王志胜; 马瑞
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-10-01

Abstract

The invention discloses an image classification method based on an improved VGG convolutional neural network model, which comprises the following steps: step 1: establishing an attention mechanism module; step 2: adding an attention mechanism in the VGG convolutional neural network model to obtain a VGG convolutional neural network model based on the attention mechanism; step 3, training the VGG convolutional neural network model based on the attention mechanism by adopting the preprocessed training set, and testing the classification result of the VGG convolutional neural network model based on the attention mechanism by adopting the preprocessed test set; when the training times reach the preset maximum iteration times or the VGG convolutional neural network model based on the attention mechanism is converged, stopping training to obtain the finally trained VGG convolutional neural network model based on the attention mechanism; and 4, step 4: and classifying the images by adopting a trained attention-based VGG convolutional neural network model. The invention can improve the image classification precision.

Description

Image classification method based on improved VGG convolutional neural network model

Technical Field

The invention belongs to the field of image classification.

Background

Due to the rapid development of hardware technology, deep learning has gained great attention in computer vision. As a branch of deep learning, the convolutional neural network exhibits an extremely strong processing capability when processing images. In the aspect of image classification, a convolutional neural network, such as VGG, ResNet, realizes a supervised learning process of an image from feature extraction to classification in an end-to-end mode. However, convolutional neural networks implement the conversion of features from low-level to high-level semantics by a large number of convolutional layers. A great amount of characteristic redundancy is inevitably generated, and the attention mechanism aims to enable the convolutional neural network to effectively learn useful information and eliminate redundant information, namely the network focuses on more distinctive characteristics and inhibits redundant characteristics. But the channel attention SENet acquires the global relationship through global average pooling, but loses much spatial information. The hybrid attention BAM attempts to establish attention in the spatial domain and the channel domain, respectively, and a local receptive field is acquired by using a convolution kernel in the spatial domain, and a global dependency relationship is still difficult to acquire.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problems in the prior art, the invention provides an image classification method based on an improved VGG convolutional neural network model.

The technical scheme is as follows: the invention provides an image classification method based on an improved VGG convolutional neural network model, which specifically comprises the following steps:

step 1: establishing an attention mechanism;

step 2: adding an attention mechanism in the VGG convolutional neural network model to obtain a VGG convolutional neural network model based on the attention mechanism;

and step 3: presetting a training set and a testing set, preprocessing images in the training set and the testing set, training a VGG convolutional neural network model based on an attention mechanism by adopting the preprocessed training set, and testing a classification result of the VGG convolutional neural network model based on the attention mechanism by adopting the preprocessed testing set, thereby adjusting parameters of the VGG convolutional neural network model based on the attention mechanism; when the training times reach the preset maximum iteration times or the VGG convolutional neural network model based on the attention mechanism is converged, stopping training to obtain the finally trained VGG convolutional neural network model based on the attention mechanism;

and 4, step 4: and (4) classifying the images by adopting the VGG convolutional neural network model based on the attention mechanism trained in the step (3).

Further, the attention mechanism module in the step 1 comprises an average pooling layer, a first dimension replacement module, a first self-attention module, a second dimension replacement module, a second self-attention module, a normalization layer and a calibration module;

the average pooling layer is characteristic of the input to attention mechanism

Spatially averaged pooling

C is the channel number of the input features, H represents the space height of the input features, and W represents the space width of the input features;

the first dimension permuting module pair

The dimension replacement is specifically: will be provided with

The space is divided into Q characteristic groups on average, P elements exist in one group, P and Q are both hyperparameters, and P is multiplied by Q and multiplied by H is multiplied by W; forming the t-th element in Q characteristic groups into the t-th column vector

The number of the first self-attention modules is P, and the first self-attention modules are to be connected with the power supply

As input to the tth first self-attention module, obtain an output

Wherein

For inner product, Softmax is a function of the probability distribution,

is a transposed symbol;

the second dimension permutation module pair Z^LThe dimension replacement is specifically: will Z^LThe space is divided into P characteristic groups, one group has Q elements, the k-th element in the P characteristic groups forms the k-th column vector

The number of the second self-attention modules is Q, and the number of the second self-attention modules is equal to Q

Inputting the current data into a kth second self-attention module to obtain the output of the kth second self-attention module:

wherein,

the normalization layer adopts sigmoid function to Y^SNormalization is carried out to obtain

The calibration module calibrates the input characteristics U according to the following formula to finally obtain the output of the attention mechanism

Wherein

Presentation pair

And the corresponding spatial position in U, and propagates along the channel direction.

Further, the VGG convolutional neural network model based on attention mechanism in step 2 includes: the system comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, a fourth feature extraction module, a fifth feature extraction module, a full connection layer and a Softmax classifier which are connected in sequence; the first and second feature extraction modules have the same structure and respectively comprise a first convolution operation module, a second convolution operation module, a first attention mechanism and a first maximum pooling layer which are connected in sequence; the third feature extraction module comprises a third convolution operation module, a fourth convolution operation module, a fifth convolution operation module, a second attention mechanism and a second maximum pooling layer which are sequentially connected; the fourth and fifth feature extraction modules have the same structure and respectively comprise a sixth convolution operation module, a seventh convolution operation module, an eighth convolution operation module and a third maximum pooling layer which are sequentially connected; the first convolution operation module, the second convolution operation module, the third convolution operation module, the fourth convolution operation module, the fifth convolution operation module and the sixth convolution operation module are identical in structure and respectively comprise a convolution layer, a ReLU activation function and a batch normalization layer which are sequentially connected; the first attention mechanism and the second attention mechanism are both the attention mechanism in the step 1.

Further, the preprocessing in the step 3 specifically includes sequentially performing horizontal turning, mirroring, clipping and standard normalization on each image in the training set and the test set.

Has the advantages that:

1. according to the invention, an attention mechanism is introduced into the VGG convolutional neural network, so that the useful information can be more effectively emphasized by the convolutional neural network, redundant information is eliminated, and the capability of the network in distinguishing features is improved.

2. The attention mechanism of the invention forms self-attention grouping twice by using long and short distances in a spatial domain, finally establishes a global dependency relationship and can better acquire context information.

3. The convolution in the convolutional neural network is an operation for acquiring a local relationship, the receptive field is limited, the attention mechanism of the invention acquires a global relationship by combining a long-distance relationship and a short-distance relationship, the range of the receptive field is the size of the whole characteristic diagram, the defect of the locality of the convolution in the VGG convolutional neural network is overcome, the network can acquire better context information, and the network classification performance is improved.

Drawings

FIG. 1 is a network model diagram of the attention mechanism of the present invention.

FIG. 2 is a diagram of a VGG convolutional neural network classification model based on an attention mechanism.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention.

The embodiment provides an image classification method based on an improved VGG convolutional neural network model, which comprises the following steps:

(1) and acquiring a natural color image with a label from the CIFAR-10 data set, preprocessing all images, and dividing the data set into a training set and a testing set. There are 10 different types of images in the training set for this embodiment.

(2) And combining the attention mechanism with the VGG convolutional neural network model to obtain the VGG convolutional neural network based on the attention mechanism.

(3) And inputting the preprocessed images of the training set into a convolutional neural network of VGG (convolutional neural network) based on an attention mechanism, and adjusting network parameters to finish network model training. Then, inputting the images of the test set into the trained network model, and evaluating the quality of the network model by using the index of classification accuracy; and finally, classifying the images by adopting a trained attention-based VGG convolutional neural network model.

The image preprocessing in the step 1 comprises the steps of horizontally turning, mirroring, cutting and standard normalization of the image in sequence. The purpose is to enhance data and improve generalization ability.

In the step 2, an attention mechanism is combined with the VGG convolutional neural network model, specifically, the attention mechanism is embedded into the VGG convolutional neural network model. The attention mechanism is a global relation model established in a space dimension, and the problem that the convolutional layer in the convolutional network is small in receptive field is solved. As shown in fig. 1, the specific processing steps of the attention mechanism on the input features are as follows:

step 2.1: will input features

As a feature vector in the spatial dimension

C represents the number of characteristic channels, H represents the height of characteristic space, W represents the width of characteristic space, which form a real number set

Of (c) is calculated. Wherein,

where (i, j) corresponds to a spatial position, i represents the ith row in the spatial position, i ∈ {1,2, …, H }; j denotes the jth column in spatial position, j ∈ {1,2, …, W }. Compressing the input characteristic input along the channel dimension by using an average pooling layer, and taking the average value of the spatial positions to obtain

Wherein

Is the average of the features within the spatial location (i, j),

is composed of

The elements of (1);

step 2.2: establishing a long distance relation in a space dimension, specifically:

step 2.2.1: using a first dimension permutation module, in a spatial dimension

Spatially equally divided into Q feature groups, with P elements in a group (also spatially divided into P × Q parts), P and Q both being hyperparameters and P × Q ═ H × W, in this embodiment: the content of P is 4, and the content of P is 4,

according to the following formula

Carry out dimension replacement to obtain

Any one of the elements and

long distance relationships between all elements in;

where Permute is a dimensional permutation,

is a transposed symbol.

Indicates to input

Dividing the vector into a tth column vector consisting of tth elements in each characteristic group after Q characteristic groups, namely a tth set;

step 2.2.2: adopting first self-attention (self-attention) modules, wherein the number of the first self-attention (self-attention) modules is P, and one column vector corresponds to one first self-attention module; tth first self-attention (self-attention) module first pair

The following operations are carried out

Wherein

Is that

Is an inner product, Softmax is mapped as a function of the probability distribution,

the long-distance relation matrix represents the similarity between two positions, and the matrix elements are limited to be between 0 and 1 through a Softmax function.

Then the tth first self-attention (self-attention) module maps the long-distance relationship matrix

And self-attention input

Multiplying to obtain the relationship between each position and all positions, specifically:

wherein

Indicating self-attention output.

Merging all P column vectors using the first self-attention module, i.e.

Step 2.3: establishing a short-distance relation in a spatial dimension, which is similar to the established long-distance relation in the spatial dimension, specifically:

step 2.3.1: using a second dimension permutation module to permute Z^LThe space is divided into P characteristic groups in average, one characteristic group has Q elements (also divided into Q multiplied by P parts in space), and the Z is calculated according to the following short distance relation formula^LAnd (3) processing:

indicates to input Z^LDividing the vector into P feature groups and then forming a kth column vector by the kth element in each feature group;

step 2.3.2: with Q second self-attentive modules, one for each column vector, the method will be described

As input to the kth second self-attention mode, an output is obtained

Wherein,

for a short distance relationship matrix, representing the similarity between two positions, the matrix elements are constrained between 0-1 by the Softmax function.

All Q column vector combinations applying self-attention, i.e.

Step 2.4: normalizing Y by using sigmoid function^SNormalizing, and then adopting a calibration module to recalibrate the original characteristic input U, specifically comprising the following steps:

wherein

Represents

Multiplied by the corresponding spatial position in U and propagated along the channel direction.

The output of the overall attention mechanism is shown.

In the step 2, the structure of a VGG convolutional neural network based on the attention mechanism is shown in fig. 2; taking the preprocessed training set as the input of a VGG convolutional neural network based on an attention mechanism, and sequentially performing a first processing stage, a second processing stage, a third processing stage, a fourth processing stage, a fifth processing stage and a sixth processing stage; wherein the first to fifth physical stages are a feature extraction stage; the resolution of the preprocessed image is 32 × 32, R, G, B3 channels;

a first treatment stage: after passing through two times of 3 × 3 convolution operation modules (i.e. convolution combination in fig. 2, the convolution operation module includes 3 × 3 convolution layers, a ReLU activation function layer and a BN batch normalization layer which are connected in sequence), attention is paid to the mechanism that the output characteristic layer is 64, the output is (32, 32, 64), and after passing through 2 × 2 maximum pooling layer, the output is (16, 16, 64).

The second stage and the first stage have the same treatment process: after two times of 3 × 3 convolution operation modules and attention mechanism, the output characteristic layer is 128, the output is (16, 16, 128), and the output is (8, 8, 128) after 2 × 2 maximum pooling layers.

And a third stage: after three times of 3-by-3 convolution operation modules and attention mechanism, the output characteristic layer is 256, the output is (8, 8, 256), and the output is (4, 4, 256) after 2-by-2 maximum pooling layers.

A fourth stage: and (4) outputting a characteristic layer of 512 after passing through the three times of 3-by-3 convolution operation modules, outputting (4, 4, 512), and outputting (2, 2, 512) after passing through the 2-by-2 maximum pooling layer.

The treatment process of the fifth stage and the fourth stage is the same: and (3) outputting a characteristic layer of 1024 through the three times of 3-by-3 convolution operation module, outputting (2, 2, 1024), and outputting (1, 1, 1024) through the 2-by-2 maximum pooling layer.

The sixth stage: through the full connectivity layer, the features (1, 1, 10) are output, and classified through a Softmax classifier.

The CIFAR-10 data set used in this implementation contains 60K color pictures, for a total of 10 classes, with a resolution of 32 x 32. Wherein 50K pictures are used as a training set for training, and 10K pictures are used as a test set for testing;

in the embodiment, during training, a model of TOP-1 precision performance is selected from a test set. In training, parameters are adjusted by a random gradient descent method, wherein the momentum is 0.8, and the weight attenuation is 5 e-4. The batch size is 64. The initial learning rate is 0.1, and every 50 traversal times (epochs) is reduced to 0.1 of the original learning rate for a total of 150 traversal times (epochs). All experiments used a GeForce RTX 2080Ti GPU.

In this embodiment, the preprocessed pictures are used as input and input to a VGG convolutional neural network model based on attention mechanism for training. And testing the classification accuracy on the test set every time the parameter of the VGG convolutional neural network model based on the attention mechanism is updated, until the model is trained to 150 traversal times (epochs), completing the model training, and keeping the model parameter unchanged. The test set accuracy will be the basis for evaluating the neural network classification quality. The comparison of the classification accuracy of the VGG convolutional neural network embedded with attention mechanism in real time and the original VGG convolutional neural network is shown in table 1:

TABLE 1

Classification network	Reference quantity (M)	Accuracy (%)
			VGG neural network	31.256	92.16
VGG neural network based on attention mechanism	31.257	92.53

As can be seen from table 1, the VGG convolutional neural network model based on the attention mechanism of the present embodiment introduces almost no parameters while improving the classification accuracy.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. An image classification method based on an improved VGG convolutional neural network model is characterized by comprising the following steps:

step 1: establishing an attention mechanism;

2. The image classification method based on the improved VGG convolutional neural network model is characterized in that the attention mechanism module in the step 1 comprises an average pooling layer, a first dimension replacement module, a first self-attention module, a second dimension replacement module, a second self-attention module, a normalization layer and a calibration module;

the average pooling layer is characteristic of the input to attention mechanism

Spatially averaged pooling

the first dimension permuting module pair

The dimension replacement is specifically: will be provided with

As input to the tth first self-attention module, obtain an output

Wherein

For inner product, Softmax is a function of the probability distribution,

is a transposed symbol;

wherein,

Wherein

Presentation pair

3. The image classification method based on the improved VGG convolutional neural network model is characterized in that the VGG convolutional neural network model based on the attention mechanism in the step 2 comprises the following steps: the system comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, a fourth feature extraction module, a fifth feature extraction module, a full connection layer and a Softmax classifier which are connected in sequence; the first and second feature extraction modules have the same structure and respectively comprise a first convolution operation module, a second convolution operation module, a first attention mechanism and a first maximum pooling layer which are connected in sequence; the third feature extraction module comprises a third convolution operation module, a fourth convolution operation module, a fifth convolution operation module, a second attention mechanism and a second maximum pooling layer which are sequentially connected; the fourth and fifth feature extraction modules have the same structure and respectively comprise a sixth convolution operation module, a seventh convolution operation module, an eighth convolution operation module and a third maximum pooling layer which are sequentially connected; the first convolution operation module, the second convolution operation module, the third convolution operation module, the fourth convolution operation module, the fifth convolution operation module and the sixth convolution operation module are identical in structure and respectively comprise a convolution layer, a ReLU activation function and a batch normalization layer which are sequentially connected; the first attention mechanism and the second attention mechanism are both the attention mechanism in the step 1.

4. The image classification method based on the improved VGG convolutional neural network model as claimed in claim 1, wherein the preprocessing in step 3 is to perform horizontal inversion, mirroring, cropping and standard normalization on each image in the training set and the test set in turn.