CN113610144A

CN113610144A - Vehicle classification method based on multi-branch local attention network

Info

Publication number: CN113610144A
Application number: CN202110881344.8A
Authority: CN
Inventors: 周平; 陈晨; 闫如根; 赵吉祥; 胡昌隆; 吕强; 李涛
Original assignee: Zenmorn Hefei Technology Co ltd
Current assignee: Zenmorn Hefei Technology Co ltd
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2021-11-05

Abstract

The invention discloses a vehicle classification method based on a multi-branch local attention network, which comprises the following specific steps of obtaining a vehicle picture to be classified and dividing the vehicle picture into a training set and a test set; inputting a training set of vehicle pictures into a vehicle classification model for constructing a multi-branch local attention network for training, wherein the multi-branch local attention network comprises a convolution attention module based on a channel and a local attention module based on a space; and carrying out classification prediction on the test set of the vehicle picture according to the trained vehicle classification model to obtain a classification result. According to the invention, a multi-branch local attention structure is added in the original ResNet-50 model, and the structure can more accurately acquire information of different neighborhoods in a characteristic diagram, so that the expressiveness of key characteristics is enhanced, and the vehicle classification accuracy is improved; meanwhile, the multi-branch attention structure has portability and can be embedded into other network models.

Description

Vehicle classification method based on multi-branch local attention network

Technical Field

The invention relates to the technical field of deep learning and computer vision, in particular to a vehicle classification method based on a multi-branch local attention network.

Background

In recent years, the development of an Intelligent Transportation System (ITS) is rapid, and along with the development of computer vision and deep learning, a chance is provided for more effective application of the intelligent transportation system. Computer vision is to simulate human visual ability by using a computer, extract information from an image of an objective object, process and understand the information, and finally apply the information to actual life, such as vehicle classification.

Conventional vision-based vehicle classification approaches typically rely on manual features such as Color Histograms (CH), Texture Descriptors (TD), GIST), or representations generated by encoding local features such as boww, IFK, SPM, etc., which are time consuming, labor intensive, have poor generalization capability, and are susceptible to environmental changes and occlusion. With the rapid development of deep learning theory and practice, target detection and classification based on deep learning enter a new stage. Different from the traditional feature extraction algorithm, the convolutional neural network has certain invariant geometric transformation, deformation and illumination, can overcome the difficulty of changing the appearance of the vehicle, can solve the problem of shielding, can adaptively describe the features constructed under the drive of training data, and has higher flexibility and comprehensive capability.

The existing convolutional neural network method still has the defects that key information and redundant information are not clearly distinguished, and the feature recognition capability of the model is still not strong enough. During the period of multi-layer convolution and pooling, a large amount of important information is lost, so that the extracted features cannot well represent the target.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and adopts a vehicle classification method based on a multi-branch local attention network to solve the problems in the background technology.

A vehicle classification method based on a multi-branch local attention network comprises the following specific steps:

obtaining a picture of a vehicle to be classified and dividing the picture into a training set and a test set;

inputting a training set of vehicle pictures into a vehicle classification model for constructing a multi-branch local attention network for training, wherein the multi-branch local attention network comprises a convolution attention module based on a channel and a local attention module based on a space;

and carrying out classification prediction on the test set of the vehicle picture according to the trained vehicle classification model to obtain a classification result.

As a further aspect of the invention: the specific steps of obtaining the picture of the vehicle to be classified and dividing the picture into a training set and a test set comprise:

acquiring a vehicle scene image, and dividing the vehicle scene image into a training set and a test set, wherein the ratio of the training set to the test set is 4: 1;

the training set comprises M training pictures which are X ═ X₁,X₂,…,X_m,…,X_MThe training set comprises N training pictures which are Y ═ Y₁,Y₂,…,Y_n,…,Y_NIn which X is_mDenotes the m-th training picture, Y_nRepresenting the nth test picture;

and carrying out one-to-one labeling on the labels and the original pictures of the training set and the test set.

As a further aspect of the invention: the vehicle classification model for constructing the multi-branch local attention network is input into the training set of the vehicle pictures for training, the multi-branch local attention network comprises a convolution attention module based on a channel, and the specific steps of the local attention module based on a space comprise:

constructing a multi-branch local attention network structure based on a ResNet-50 model;

using a channel-based convolution attention module to be introduced as a first branch;

a local attention module based on space is used for introduction as a second branch;

the two branches are fused by a parallel method, and the fusion formula is as follows:

F′＝σ(M_C(F)×M_S(F))×F；

wherein M is_c(F) Indicating the channel attention, M_s(F) Representing spatial attention, σ () is a sigmoid activation function;

and embedding a multi-branch local attention network structure into each bottleeck layer of the ResNet-50 to obtain a vehicle classification model based on the multi-branch local attention network.

As a further aspect of the invention: the specific steps introduced as the first branch using the channel-based convolution attention module include:

aggregating spatial information of feature maps using global average pooling and global maximum pooling operations, i.e.

And

by using 2 x 1 convolution method

And

combining the two parts;

finally, a multi-layer perceptron is added to learn the final channel attention feature map M_c(F) The formula is as follows:

wherein, W₀∈R^(C/r)×C，W₁∈R^C，W₀And W₁Is the weight of the multi-layer perceptron MLP, f^2×1Representing a convolution operation with a filter size of 2 x 1.

As a further aspect of the invention: the specific steps of using the space-based local attention module as a second branch include:

applying local space maximum pooling and average pooling in parallel to all channels of the original feature map, wherein the kernel and step size are equal to epsilon;

generating two compressed spatial attention descriptors F by aggregating features within all ε -neighborhoods of F^MAnd F^A；

F^A＝AvgPool_{kernel,stride＝ε}(F)；

F^M＝MaxPool_{kernel,stride＝ε}(F)；

Tightening F along the channel direction using global max pooling and global average pooling^MAnd F^AGenerating descriptors

And

and are connected together;

sequentially increasing the convolution of the 3 multiplied by 3 holes and the nearest neighbor interpolation operation to obtain a characteristic diagram of space attention;

M_S(F)＝σ(f_nearest(f([F_max(F^M)；F_avg(F^A)])))；

wherein MaxPoint and AvgPool represent the kernel and the local maximal pooling and average pooling operation with step size ε in the spatial domain, F_max(. about.) and F_avg(x) is the maximum pooling and average pooling operation along the channel direction, f (x) represents the convolution of 3 × 3 holes, f_nearest() is the nearest neighbor interpolation upsampling operator and σ () is the sigmoid activation function.

As a further aspect of the invention: the specific steps of carrying out classification prediction on the test set of the vehicle picture according to the trained vehicle classification model to obtain a classification result comprise:

setting a loss function, adjusting the size of the vehicle image in the training set, inputting the vehicle image into a multi-branch local attention network for training to obtain a trained network model;

and (4) utilizing the extraction network full-link layer of the trained network model to classify and predict the vehicle images in the test set by using a Softmax function to obtain a classification result.

Compared with the prior art, the invention has the following technical effects:

by adopting the technical scheme, the vehicle pictures are obtained, then the vehicle classification model of the multi-branch local attention network is constructed on the basis of the RestNet-50 model, wherein the vehicle classification model comprises a convolution attention module based on a channel and a local attention module based on a space, and then the vehicle pictures are predicted and classified by extracting the network full-connection layer and using a softmax function. The method can more accurately acquire the information of different neighborhoods in the feature map, enhance the expressive property of the features and improve the accuracy of vehicle classification. Meanwhile, the problem that the extracted features cannot well represent the target due to repeated pooling and convolution and loss of a large amount of important information is avoided.

Drawings

The following detailed description of embodiments of the invention refers to the accompanying drawings in which:

FIG. 1 is a schematic step diagram of a vehicle classification method according to some embodiments disclosed herein;

FIG. 2 is a schematic network model flow diagram of some embodiments disclosed herein;

FIG. 3 is a schematic diagram of a network architecture of some embodiments disclosed herein;

FIG. 4 is a schematic diagram of a channel-based convolution attention module of some embodiments disclosed herein;

FIG. 5 is a schematic view of a spatial-based local attention module of some embodiments disclosed herein;

fig. 6 is a schematic diagram of a multi-branch attention structure integrated CNN unit of some embodiments disclosed herein.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 and fig. 2, in an embodiment of the present invention, a vehicle classification method based on a multi-branch local attention network includes:

s1, obtaining the picture of the vehicle to be classified and dividing the picture into a training set and a testing set, wherein the method comprises the following specific steps:

acquiring an image of a required vehicle in a specific scene, and dividing the image into a training set and a test set, wherein the ratio of the training set to the test set is 4: 1;

S2, inputting a training set of vehicle pictures into a vehicle classification model for constructing a multi-branch local attention network for training, wherein the multi-branch local attention network comprises a convolution attention module based on a channel and a local attention module based on a space, and the method specifically comprises the following steps:

s21, constructing a multi-branch local attention network structure based on the ResNet-50 model; specifically, as shown in fig. 3, a frame structure diagram of a vehicle image classification model is illustrated, a classic CNN model is combined with an attention mechanism, and an output dimension of the classification model is set as the total number of vehicle categories. And the convolutional neural network ResNet-50 is used as a backbone network of a classification model for extracting an original feature map from the vehicle image.

And S22, introducing a convolution attention module based on the channel as a first branch for calculating the weight of the feature map channel, and acquiring the feature map with updated channel attention based on the calculated weight of the feature map channel and the original feature map.

In an embodiment of step S22, as shown in fig. 4, the step of using the channel-based convolution attention module as the first branch includes:

the spatial information of the feature map is first aggregated using global average pooling and global maximum pooling operations, i.e. two 1 × 1 × c channels are obtained

And

mean pooling and maximum pooling, respectively.

Then using 2 x 1 convolution method to make

And

combining the two parts;

finally, a multi-layer perceptron is added to learn the final channel attention feature map M_c(F) In order to reduce the parameter overhead, the size of the hidden layer is set to C/r, where r is the compression ratio. Specifically, it has been found experimentally that better results can be obtained by setting the compression ratio r to 16, with the following formula:

And S23, introducing a local attention module based on space as a second branch for calculating the similarity of each feature in the feature map with other features to obtain the feature map after the spatial attention is updated.

In an embodiment of step S23, as shown in fig. 5, the step of using the space-based local attention module as the second branch includes:

spatial attention is achieved by using local similarity (spatial local pooling), and similarity calculation of non-adjacent image regions is achieved by calculating the similarity of each feature in the feature map to other features.

F^A＝AvgPool_{kernel,stride＝ε}(F)；

F^M＝MaxPool_{kernel,stride＝ε}(F)；

Tightening F along the channel direction using global max pooling and global average pooling^MAnd F^AGenerate a descriptor sum

And

each feature represents the maximum pooling feature and the average pooling feature across the channel. And connecting the feature descriptors to reduce the computation cost;

in order to reduce the feature loss caused by the multi-step pooling operation, the 3 x 3 hole convolution is added to further learn the features, and meanwhile, the nonlinear representation capability of the spatial attention descriptor is improved. Finally, upsampling is carried out through nearest neighbor interpolation, and a space attention feature map M with the same scale as the original input image is obtained_S(F)；

M_S(F)＝σ(f_nearest(f([F_max(F^M)；F_avg(F^A)」)))；

S24, merging the two branches by using a parallel method, as shown in fig. 6, connecting the feature diagram output from the above step after updating the attention of the channel and the feature diagram output from the above step after updating the attention of the space in parallel, and embedding the feature diagrams into each Layer of the backbone network;

the fusion formula is:

F′＝σ(M_C(F)×M_S(F))×F；

wherein M is_c(F) Indicating the channel attention, M_s(F) Representing spatial attention, σ () is a sigmoid activation function; the parallel method has fewer activation functions than the sequential method, thereby having a larger characterization range and stronger feature extraction capability.

S3, carrying out classification prediction on the test set of the vehicle picture according to the trained vehicle classification model to obtain a classification result, and the specific steps comprise:

s31, setting a loss function, adjusting the size of the vehicle image in the training set, inputting the vehicle image into a multi-branch local attention network for training, and obtaining a trained network model;

specifically, the size of the vehicle image is adjusted to 224 × 224 pixels, and then the pixels of the image are normalized (normalization) to be used as the input of the classification model. The loss function is a cross entropy loss function.

When training, parameters in the CNN model pre-trained by ImageNet are loaded, but parameters of the fixed (freeze) classification model are used, and fine-tuning (fine-tune) is performed on the basis, for example, all parameters of the backbone network are fine-tuned by gradient descent.

And S32, carrying out classification prediction on the vehicle images in the test set by using a Softmax function through the trained network full-link layer to obtain a classification result.

In some embodiments, the backbone network in step S2 is ResNet-50, which can be selected from other networks, such as AlexNet, VGGNet, google lenet, or densneet.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents, which should be construed as being within the scope of the invention.

Claims

1. A vehicle classification method based on a multi-branch local attention network is characterized by comprising the following specific steps:

2. The method for classifying vehicles based on the multi-branch local attention network according to claim 1, wherein the specific steps of obtaining the images of the vehicles to be classified and dividing the images into a training set and a test set comprise:

3. The method for classifying vehicles based on the multi-branch local attention network as claimed in claim 1, wherein the inputting the training set of vehicle pictures into the vehicle classification model for constructing the multi-branch local attention network is trained, the multi-branch local attention network comprises a convolution attention module based on channels, and the specific steps of the local attention module based on space comprise:

F′＝σ(M_C(F)×M_S(F))×F；

4. The method for classifying vehicles based on a multi-branch local attention network according to claim 3, wherein the step of introducing the first branch by using the channel-based convolution attention module comprises:

aggregating nulls for feature maps using global average pooling and global maximum pooling operationsInter information, i.e.

And

by using 2 x 1 convolution method

And

combining the two parts;

5. The method for classifying vehicles based on the multi-branch local attention network as claimed in claim 3, wherein the step of introducing the local attention module based on space as the second branch comprises:

F^A＝AvgPool_{kernel,stride＝ε}(F)；

F^M＝MaxPool_{kernel,stride＝ε}(F)；

And

and are connected together;

6. The method for classifying the vehicle based on the multi-branch local attention network according to claim 1, wherein the step of performing classification prediction on the test set of the vehicle image according to the trained vehicle classification model to obtain the classification result comprises: