CN110766063B

CN110766063B - Image classification method based on compressed excitation and tightly connected convolutional neural network

Info

Publication number: CN110766063B
Application number: CN201910987689.4A
Authority: CN
Inventors: 马廷淮; 杨明明; 宋彪; 金子龙
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2023-04-28
Anticipated expiration: 2039-10-17
Also published as: CN110766063A

Abstract

The invention discloses an image classification method based on compression excitation and a tightly-connected convolutional neural network, which combines a lightweight tightly-connected convolutional neural network (DenseNet) with a high-performance compression and excitation module (SE), calculates a loss function by training the convolutional neural network, and updates the network according to gradient descent; testing a convolutional neural network and calculating classification accuracy; and repeating the steps to construct an accuracy value with highest storage accuracy and a convolutional neural network model parameter, so as to obtain the convolutional neural network model with the best effect. The compression and excitation module can explicitly model the interdependence relationship between channels, and has small calculated amount; compared with the traditional convolutional neural network image classification method, the method can obtain high-accuracy image classification results with a small quantity of parameters and calculation amount.

Description

Image classification method based on compressed excitation and tightly connected convolutional neural network

Technical Field

The invention belongs to the field of computer vision, and particularly relates to an image classification method based on compressed excitation and a tightly connected convolutional neural network.

Background

In 2016, the world champion of alpha go and go, the profession nine-player plum stone performed a man-machine go war of great attention, and the result was four to one great wins. Nouns such as artificial intelligence and deep learning have come into the field of view of the masses from now on, and we have also come into the AI era of the whole people. A simple photo of a face often contains a lot of information, such as age, gender, race, appearance, etc., which can be obtained from the picture and classified by using related techniques in the field of artificial intelligence.

Image classification refers to the process of automatically classifying images into a set of predefined categories according to certain classification rules. The basic process of image classification is divided into two parts, a training image and a test image. The training image is divided into three parts (1) data preprocessing (2) feature extraction and representation (3) classifier design and learning. The test image is also divided into three parts, the first two parts are the same as the training image, (1) data preprocessing, (2) feature extraction and representation (3) classification decision. The performance of image classification is closely related to the feature extraction and classification methods.

The image feature extraction is the basis of image classification, the traditional method adopts the manually designed features to extract the features, whether the manually designed features are reasonable or not, so that great uncertainty is brought to the performance of image classification, and the problem is solved by the occurrence of a convolutional neural network. In 2012, alexin et al proposed alexin et al to obtain a first name in an ImageNet contest, performance far exceeding a second name, from which convolutional neural networks and deep learning have received extensive attention and development. Various new models are growing at macroscopic speeds, such as ZF-Net at 2013, google-Net at 2014 and VGG at VGG, resNet at 2015, denseNet at 2016, etc. Convolutional neural networks do not require manual feature extraction, and can automatically learn complex and useful features of various levels directly from large image datasets. For example, the bottom level features may be features of a contour class, and the highest level features are features of the most basic line class. The convolutional neural network is also an end-to-end image classification method, and the end-to-end model comprises a plurality of modules, each module is designed for a specific task and has own input and output, one end of the module is an original image, the other end of the module is the output of the module, and the modules form a whole to complete the final task.

The convolutional neural network remarkably improves the image classification performance and greatly promotes the development of computer vision. However, with the development of convolutional neural networks, in order to improve the accuracy of the model, the depth of the network is continuously increased, and the model is becoming larger, which greatly increases the calculation cost, and requires larger calculation resources and image data to train the network. There is a need for an image classification method that reduces computational costs and improves convolutional neural network performance.

Disclosure of Invention

The invention aims to: aiming at the defects of the prior art, the invention provides an image classification method based on compressed excitation and tightly connected convolutional neural network, which can improve the performance of the convolutional neural network.

The technical scheme is as follows: the invention discloses an image classification method based on a compression excitation module and a tightly connected convolutional neural network, which comprises the following steps:

(1) Preprocessing the collected pictures containing the category labels, converting the pictures into tensors, and forming a training set and a testing set;

(2) Training a convolutional neural network, prescribing training times, inputting a picture tensor of a training set into the tightly-connected convolutional neural network combined with a compression excitation module, inputting an output result into a softmax function, calculating the probability that the picture belongs to each category, and marking the probability as a prediction label;

(3) Comparing the prediction label obtained in the step (2) with a category label contained in the picture, calculating the deviation between the prediction label and an actual label through a loss function, calculating the gradient of the convolutional neural network parameter according to the loss function, and updating the network parameter by using a gradient descent method;

(4) Testing the convolutional neural network, inputting the picture tensor of the test set into the updated convolutional neural network to obtain a prediction label of the test picture, comparing the prediction label with the picture label contained in the prediction label, calculating and recording the prediction accuracy of the convolutional neural network, and storing the model parameters of the convolutional neural network;

(5) Repeating the step (2), the step (3) and the step (4), obtaining the prediction accuracy of the convolutional neural network of the test set after updating again, comparing with the previous prediction accuracy, and storing the accuracy with higher accuracy and the convolutional neural network model parameters;

(6) After the specified training times are reached, stopping training and testing, outputting the highest accuracy and storing the corresponding convolutional neural network parameters, and obtaining the convolutional neural network model with the best effect.

Further, the preprocessing of the picture in the step (1) is realized by the following formula:

wherein μ is the mean of the pictures, X represents the picture tensor, σ represents the standard deviation, max represents the maximum value of the picture tensor, min represents the minimum value of the picture tensor, X ₁ Representing normalized picture tensor, x ⁰ Representing the normalized picture tensor.

Further, the ratio of the training set to the test set in the step (1) is 5:1.

Further, the step (2) includes the steps of:

(21) Each convolution layer contains a series of nonlinear transforms F ^l (. Cndot.) contains normalization (BN), modified linear units (ReLU) and convolution operations (Conv), l representing the number of layers:

wherein x＝[x₁ ，x ₂ ，…，x _D ]Is tensor input with D channels, w _i Is the weight on the corresponding ith channel on the convolution kernel, and the output tensor size of the convolution layer satisfies the following formula:

wherein O is the size of the output tensor, I is the size of the input tensor, K is the size of the convolution kernel, P is the zero filling number, S is the moving step length, and the number of channels of the output tensor is equal to the convolution kernel number;

(22) The next layer of DenseNet is directly connected with all the previous layers, the input of the first layer is the splice of all the previous layers, and the output of the first layer can be expressed as:

y ^l ＝F ^l (x ^l )＝F ^l ([x ⁰ ，y ¹ ，…，y ^l-1 ])

wherein x^l ＝[x ⁰ ，y ¹ ，…，y ^l-1 ]Is an input of the first layer, y ^l Is the output of the first layer, the prerequisite for the splicing operation is x ⁰ And from y ¹ To y ^l-1 The tensor size of (1) is unchanged, a convolution kernel of 3x3 with a step size of 1 and zero padding with a size of 1, i.e. k=3, p=1, s=1;

(23) Y is recorded ^l ＝[y ₁ ，y ₂ ，…，y _C ]C is y ^l Is equal to the convolution kernel number, then y ^l Input compression excitation module (SE), first compression operation, generate a channel descriptor z= [ z) through global average pooling ₁ ，z ₂ ，…，z _C ]：

Where H W is tensor y ^l The channel descriptor z obtained by compressing the spatial features contains global spatial information, and then performs excitation operation to completely capture the channel dependency relationship using a gating mechanism containing sigmoid function, as shown in the following formula:

s＝σ(g(z，W))＝σ(W ₂ δ(W ₁ z))

sigma denotes the sigmoid function, delta denotes the ReLU function, and the two linear layers (FC) each contain the parameter W ₁ and W₂ Forming a bottleneck layer; each channel of tensors is called a feature map, which is on-going by s-pair feature mapsThe track dimension is multiplied by a point to obtain a weight for each feature map to represent the importance of each feature map in the tensor in the global acceptance domain, as follows:

y _c ：＝s _c ·y _c

excitation operation recalibration output y ^l ＝[y ₁ ，y ₂ ，…，y _C ]Characteristic response on channel, will output y ^l ＝[y ₁ ，y ₂ ，…，y _C ]And transmitting to the next layer, repeating the process, and finally inputting the output result of the tightly connected network combined with the compression excitation module into a softmax function, calculating the probability that the picture belongs to each category, and recording as a prediction label Y of the picture.

Further, the step (3) includes the steps of:

(31) Calculating the deviation of a predicted label and an actual label through a cross entropy loss function, giving two probability distributions p and q, and expressing the cross entropy loss of p through q as follows:

wherein P represents a label Y-q of the picture, q represents a predicted value Y, and the smaller the cross entropy is, the closer the two probability distributions are, namely the closer the predicted label is to the real label;

(32) Calculating convolutional neural network parameters θ from cross entropy loss functions _i And updating parameters of the network by using a gradient descent method:

wherein ,L(θ_i ) Representing the loss function in θ _i As a parameter, α represents a learning rate for controlling the gradient descent speed.

The beneficial effects are that: compared with the prior art, the invention has the beneficial effects that: the tightly connected convolutional neural network takes the output of all layers in front of the current layer as input, so that the characteristic weight is realized, the efficiency of parameters is improved, and the model can obtain good performance by using only a small amount of parameters; the mutual dependency relationship between modeling channels displayed by the compression excitation module is self-adaptively recalibrates the characteristic response of the channel direction, so that characteristic selection is realized, information characteristics are selectively emphasized, and useless characteristics are restrained; the combination of the two not only reduces the scale and parameter quantity of the model, but also greatly improves the performance of the convolutional neural network.

Drawings

FIG. 1 is a structural flow diagram of a tightly-coupled convolutional neural network based on a combined compressive excitation module;

fig. 2 is a network architecture diagram based on a tightly-coupled convolutional neural network incorporating a compressed excitation module.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The invention discloses a lightweight image classification method based on a tightly-connected convolutional neural network combined with a compression excitation module, aiming at the problem that the capacity of a model is overlarge because of the improvement of performance of a current convolutional neural network image classification model. The tightly connected convolutional neural network (DenseNet) takes the output of all layers in front of the current layer as input, so that the characteristic weight is realized, the efficiency of parameters is improved, and the model can obtain good performance by using only a small amount of parameters. The compression excitation module (SE) displays the interdependence relation between modeling channels, and the characteristic response of the channel direction is self-adaptively recalibrated, so that characteristic selection is realized, information characteristics are selectively emphasized, and useless characteristics are restrained. By combining the two, the performance of the convolutional neural network is greatly improved, the embodiment combines the tightly convolutional neural network, the parameter number of the model is greatly reduced on the premise of ensuring high accuracy of image classification, and the structure of the model is simplified as shown in fig. 2.

As shown in fig. 1, the present invention specifically includes the following steps:

1. image preprocessing to form training set and test set

(1) The dataset contains 6 tens of thousands of pictures, each containing a tag Y. Firstly, the data set is divided into a training set and a testing set, wherein the training set comprises 5 ten thousand pictures, and the testing set comprises 1 ten thousand pictures.

(2) All pictures are cut into fixed shapes, 32X32, then the picture level is randomly and horizontally flipped to expand the training dataset, finally the pictures are converted into tensors X, the tensors are normalized by using the channel mean and standard deviation, and then the picture tensors are normalized to between 0 and 1. The core formula of this process is:

2. Training convolutional neural networks

(1) And (3) defining the training times, inputting the picture tensor of the training set obtained in the step (1) into a tightly-connected convolutional neural network combined with the compression excitation module, and enabling the input picture tensor to enter the tightly-connected block after passing through a convolutional layer. Each convolution layer contains a series of nonlinear transforms F ^l (. Cndot.) contains normalization (BN), modified linear units (ReLU) and convolution operations (Conv), l representing the number of layers. The core formula of this process is:

/>

wherein x＝[x₁ ，x ₂ ，…，x _D ]Is a tensor input with D channels. w (w) _i Is a convolution kernel upper pairThe weight on the ith channel should be. The output tensor size of the convolutional layer satisfies the following formula:

wherein O is the size of the output tensor, I is the size of the input tensor, K is the size of the convolution kernel, P is the zero padding number, S is the moving step length, and the number of channels of the output tensor is equal to the convolution kernel number.

(2) The next layer of DenseNet is directly connected to all the previous layers, so the input of the first layer is a splice of all the previous layers, and the output of the first layer can be expressed as:

y ^l ＝F ^l (x ^l )＝F ^l ([x ⁰ ，y ¹ ，…，y ^l-1 ])

wherein x^l ＝[x ⁰ ，y ¹ ，…，y ^l-1 ]Is an input of the first layer, y ^l Is the output of the first layer, the prerequisite for the splicing operation is x ⁰ And from y ¹ To y ^l-1 We use a convolution kernel of 3x3 with a step size of 1 and zero padding of 1, i.e. k=3, p=1, s=1, without change in the tensor size of (a).

(3) Y is recorded ^l ＝[y ₁ ，y ₂ ，…，y _C ]C is y ^l Is equal to the convolution kernel number, then y ^l Input compression excitation module (SE), first compression operation, generate a channel descriptor z= [ z) through global average pooling ₁ ，z ₂ ，…，z _C ]The core formula is as follows:

where H W is tensor y ^l The channel descriptor z obtained by compressing the spatial features contains global spatial information, and then the excitation operation is performed, and a gating mechanism containing sigmoid function is used to completely capture the channel dependence, as shown in the following formulaThe following is shown:

s＝σ(g(z，W))＝σ(W ₂ δ(W ₁ z))

wherein sigma represents a sigmoid function, delta represents a ReLU function, and the two linear layers (FC) each comprise a parameter W ₁ and W₂ Bottleneck layers are composed to reduce the number of parameters and fit more complex nonlinear relationships. Each channel of the tensor is called a feature map, and the importance of each feature map in the tensor in the global acceptance field is expressed by s performing point multiplication on the feature map in the channel dimension to obtain the weight of each feature map, and the formula is as follows:

y _c ：＝s _c ·y _c

excitation operation recalibration output y ^l ＝[y ₁ ，y ₂ ，…，y _C ]Characteristic response on the channel. Then output y ^l ＝[y ₁ ，y ₂ ，…，y _C ]And transmitting to the next layer, repeating the process, and finally inputting the output result of the tightly connected network combined with the compression excitation module into a softmax function, calculating the probability that the picture belongs to each category, and recording as a prediction label Y of the picture.

3. Calculating a loss function, updating the network according to gradient descent

(1) Comparing the prediction label Y in the step 2 with the class label Y which is carried by the picture, calculating the deviation between the prediction label and the actual label through a cross entropy loss function, giving two probability distributions p and q, and expressing the cross entropy loss of p through q as follows:

wherein P represents the label Y-q of the picture, q represents the predicted value Y, and the smaller the cross entropy is, the closer the two probability distributions are, namely the closer the predicted label is to the real label. Assuming that the pictures are classified into three classification tasks, the class label y= (1, 0) of a certain picture, and the model outputs the prediction label y= (0.5,0.4,0.1) obtained after softmax regression, then the cross entropy is:

L((1，0，0)，(0.5，0.4，0.1))＝-(1×log0.5+0×log0.4+0×log0.1)≈0.3

(2) Calculating convolutional neural network parameters θ from cross entropy loss functions _i And then updating the parameters of the network by using a gradient descent method. The gradient descent method is shown in the following formula:

4. Testing convolutional neural network, calculating classification accuracy

(1) Inputting the picture tensor of the test set into the updated convolutional neural network to obtain the probability that the test picture belongs to each category, and marking the category with the highest probability as the prediction label of the picture. Assuming that the model output is p= (0.7,0.2,0.1) as predicted after softmax regression, the predicted label is denoted y= (1, 0).

(2) Comparing the prediction label Y with the real label Y of the picture, and calculating the same quantity of the prediction label Y and the real label Y of the picture in the test set, so as to calculate the prediction accuracy of the convolutional neural network, and record the model accuracy and the model parameters of the convolutional neural network.

5. And (3) repeating the steps (2), 3 and 4), updating the parameters of the convolutional neural network, calculating the prediction accuracy of the convolutional neural network on a test set, comparing with the result of the last prediction accuracy, and storing the accuracy with higher accuracy and the convolutional neural network model.

6. After the specified training times are reached, training is set for 300 times, training and testing are stopped, the highest accuracy is output, and the corresponding convolutional neural network parameters and models are stored, so that the convolutional neural network model with the best effect is obtained.

Claims

1. An image classification method based on compressed excitation and tightly-connected convolutional neural network, which is characterized by comprising the following steps:

(6) Stopping training and testing after the specified training times are reached, outputting the highest accuracy and storing the corresponding convolutional neural network parameters, and obtaining the convolutional neural network model with the best effect;

the step (2) comprises the following steps:

(21) Each convolution layer contains a series of nonlinear transforms F ^l Contain normalization (BN), modified linear units (ReLU) and convolution operations (Conv), l represents the number of layers:

wherein x＝x₁ ,x ₂ ,…,x _D Is tensor input with D channels, w _i Is the weight on the corresponding ith channel on the convolution kernel, and the output tensor size of the convolution layer satisfies the following formula:

y ^l ＝F ^l x ^l ＝F ^l x ⁰ ,y ¹ ,…,y ^l-1

wherein x^l ＝x ⁰ ,y ¹ ,…,y ^l-1 Is an input of the first layer, y ^l Is the output of the first layer, the prerequisite for the splicing operation is x ⁰ And from y ¹ To y ^l-1 The tensor size of (1) is unchanged, a convolution kernel of 3x3 with a step size of 1 and zero padding with a size of 1, i.e. k=3, p=1, s=1;

(23) Y is recorded ^l ＝y ₁ ,y ₂ ,…,y _C C is y ^l Is equal to the convolution kernel number, then y ^l Input compression excitation module, firstly perform compression operation, generate a channel descriptor z=z through global average pooling ₁ ,z ₂ ,…,z _C ：

Where c represents the number of channels and H W is the tensor y ^l The channel descriptor z obtained by compressing the spatial features contains global spatial information, and then performs excitation operation to completely capture the channel dependency relationship using a gating mechanism containing sigmoid function, as shown in the following formula:

wherein ,

representing a sigmoid function, delta representing a ReLU function, the two linear layers each comprising a parameter W ₁ and W₂ Forming a bottleneck layer; each channel of the tensor is called a feature map, and the importance of each feature map in the tensor in the global acceptance field is represented by s performing point multiplication on the feature map in the channel dimension to obtain the weight of each feature map, and the formula is as follows:

y _c :＝s _c ·y _c

excitation operation recalibration output y ^l ＝y ₁ ,y ₂ ,…,y _C Characteristic response on channel, will output y ^l ＝y ₁ ,y ₂ ,…,y _C Transmitting to the next layer, repeating the above process, and finally inputting the output result of the tightly connected network combined with the compression excitation module into a softmax function, calculating the probability that the picture belongs to each category, and recording as a prediction label Y of the picture;

the step (3) comprises the following steps:

wherein p represents the label Y of the picture ^{^} Q represents a predicted value Y, and the smaller the cross entropy is, the closer the two probability distributions are, namely the closer the predicted label is to the real label;

wherein ,Lθ_i Representing the loss function in θ _i As a parameter, α represents a learning rate for controlling the gradient descent speed.

2. The image classification method based on compressed excitation and tightly-connected convolutional neural network according to claim 1, wherein the picture preprocessing of step (1) is implemented by the following formula:

wherein μ is the mean of the pictures, X represents the picture tensor, σ represents the standard deviation, max represents the maximum value of the picture tensor, min represents the minimum value of the picture tensor, K ₁ Representing normalized picture tensor, x ⁰ Representing the normalized picture tensor.

3. The method of image classification based on compressed excitation and tightly-coupled convolutional neural networks of claim 1, wherein the ratio of training set to test set in step (1) is 5:1.