CN113420742B

CN113420742B - Global attention network model for vehicle weight recognition

Info

Publication number: CN113420742B
Application number: CN202110977958.6A
Authority: CN
Inventors: 庞希愚; 田鑫; 王成; 姜刚武; 郑艳丽
Original assignee: Shandong Jiaotong University
Current assignee: Shandong Jiaotong University
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2022-01-11
Anticipated expiration: 2041-08-25
Also published as: CN113420742A

Abstract

The invention relates to the technical field of vehicle identification, in particular to a global attention network model for vehicle weight identification, which comprises a backbone network, a local branch for dividing a feature map into two parts and two global branches with global attention modules, wherein the local branch is connected with the two local branches; the backbone network is split into 3 branches; the global attention network model extracts feature vectors using global average pooling to cover the entire vehicle information; the local branch only divides the feature map into two parts horizontally. The invention constructs a global attention network with three branches to extract a large amount of discriminative information; two global attention modules, namely CGAM and SGAM, are constructed, the global relation of nodes is modeled through the average pairwise relation among the nodes, the importance degree of the nodes is deduced, and the calculation complexity is reduced; the feature diagram is only horizontally divided into two parts on the local branch, and the problems of misalignment and local inconsistency are solved to a great extent.

Description

Global attention network model for vehicle weight recognition

Technical Field

The invention relates to the technical field of vehicle identification, in particular to a global attention network model for vehicle weight identification.

Background

The vehicle weight recognition refers to recognition of target vehicles under different cameras, plays an important role in intelligent transportation and smart cities, and has a lot of applications in real life. For example, in real traffic monitoring systems, vehicle re-identification may serve as a location, supervision, and criminal investigation for the target vehicle. With the rise of deep neural networks and the introduction of large data sets, improving the accuracy of vehicle re-identification has become a research hotspot in the fields of computer vision and multimedia in recent years. However, due to different viewing angles of the plurality of cameras and influences of illumination, shielding and the like, the intra-class feature distance becomes larger, the inter-class feature distance becomes smaller, and the difficulty of identification is further increased.

Pedestrian heavy identification and vehicle heavy identification are essentially the same and both belong to the image retrieval task. In recent years, Convolutional Neural Network (CNN) based methods have made great progress in pedestrian re-identification. Therefore, the CNN model applied to pedestrian re-recognition also has good performance in vehicle recognition. Most advanced CNN-based pedestrian re-recognition methods employ CNN models pre-trained on ImageNet and adjust them on the re-recognition dataset under different loss supervision.

CNN-based vehicle and pedestrian re-identification is often focused on extracting global features of people or vehicle images. In this way, complete feature information can be obtained globally, but the global features cannot well describe intra-class differences caused by factors such as viewing angles. In order to extract fine-grained local features, pedestrian re-identification Network models such as a PCB (partial-based Convolutional base) with local branches and an MGN (Multiple granular Network) are designed. These networks divide the feature map into several strips to extract local features. In addition, the latter combines local features with global features, further improving the performance of the model. For vehicle weight recognition, vehicles of the same vehicle type are substantially identical in global appearance. While in small areas, such as check marks, decorations, and usage marks, they may vary greatly. Therefore, the local detailed information of the automobile is also important for the task of identifying the weight of the automobile.

However, these local-based models have a common disadvantage in that they require relatively aligned body parts for the same person in order to learn significant local features. Although vehicle heavy recognition and pedestrian heavy recognition are both image retrieval problems in nature, the body part boundaries of vehicles are not as sharp as pedestrians, and the bodies of the same vehicle are greatly different from one another when viewed from different angles. On the other hand, strictly uniform partitioning of the feature map destroys local intra-consistency. And the destruction degree of local consistency is generally proportional to the number of local partitions, i.e. the greater the number of partitions, the easier it is to destroy local intra-consistency. This makes it difficult for deep neural networks to obtain meaningful fine-grained local information from the local, thereby degrading performance. Therefore, it is not feasible to simply apply the partial segmentation method in the task of re-identifying the pedestrian to the vehicle.

Attention mechanisms play an important role in the human perception system, helping people focus on identifying useful distinctive features, eliminating some noise and background interferences. For network models, the attention mechanism may focus the model on the target subject rather than the background, and is widely used in the task of re-recognition. Therefore, many networks with attention modules are proposed. However, they mainly build attention to nodes (channels, spatial positions) by direct convolution on self information, or directly reconstruct nodes using pairwise relationships between nodes, and do not consider that the global relationships between nodes have an important guiding role in building attention (importance) to nodes.

In the task of vehicle weight identification, illumination change, perspective change and resolution difference can be generated at different camera positions, so that the intra-class difference of the same vehicle at different visual angles is large, or the inter-class difference of different vehicles is small due to the same vehicle type. This greatly increases the difficulty of the vehicle re-identification task. The key to vehicle re-identification is the extraction of vehicle discriminating features. In order to better extract such features from the vehicle image and improve the accuracy of the recognition, it is necessary to propose a global attention network model for vehicle re-recognition.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a global attention network model for vehicle weight recognition, which can simply extract local fine information and solve the problems of local misalignment and local consistency damage to a great extent; and reliable attention of the nodes can be established according to the global relationship between the nodes, so that more credible significance information for vehicle re-identification is extracted.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a global attention network model for vehicle weight recognition comprises a backbone network, a local branch dividing a feature map into two parts and two global branches with global attention modules; the backbone network is split into 3 branches; the global attention network model extracts and obtains a feature vector on a final feature map output by each branch by using a global average pooling GAP (GAP) so as to cover the whole body information of the vehicle image; the local branch only horizontally divides the vehicle characteristic diagram into two parts, and the problems of misalignment and local consistency damage can be solved to a great extent.

The two global branches respectively have a channel global Attention module cgam (channel global Attention module) and a spatial global Attention module sgam (spatial global Attention module) for extracting more reliable saliency information. The backbone network employs the ResNet50 network model.

In order to improve the resolution, the step size of down-sampling of res _ conv5_1 blocks of Global Branch Global-C Branch and Local Branch is changed from 2 to 1, and then a spatial Global attention module and a channel Global attention module are respectively added after res _ conv5 blocks of two Global branches to extract reliable significance information and enhance the feature discrimination capability, wherein res _ cov5 represents the fifth layer of a Resnet50 network model; res _ cov5_1 denotes the first component block in the fifth layer of the Resnet50 network model.

After extracting the feature vectors using the global mean pooling GAP on each branch, the feature dimensionality reduction module, which contains 1 × 1 convolution, BN layers, and ReLU functions, reduces the feature vector dimensionality to 256, providing a compact feature representation. The network model is trained by applying triple loss and cross entropy loss on each branch, specifically, the triple loss is directly applied on the 256-dimensional feature vector, and a full connection layer is added behind the 256-dimensional feature vector to apply the cross entropy loss. In the testing stage, the features before the fully connected layer of the three branches are connected as the final output features.

The CGAM architecture: quantity of design

Is a feature map of a CGAM input, in which

The number of the channels is the number of the channels,

and

the spatial height and width, respectively, of the tensor; slave function

And

to obtain tensor

And

and will be

Is deformed into

Will be

Is deformed into

，

And

the architecture is identical, consisting of two 1 × 1 convolutions and two 3 × 3 packet convolutions, as well as two BatchNormal layers and two Relu activation functions. The above-mentioned

Architecture, using two 3 x 3 packet convolutions to increase the receptive field and decrease the number of parameters. Then, matrix multiplication is used to obtain matrix

It shows pairwise relationships of all channels.

Writing into:

；

in addition, the matrix

Each row element of (a) represents a pair-wise relationship between each channel and all other channels. The average pairwise relationship of the channels is modeled to obtain a global relationship of the channels. The global relationship importance of one channel relative to the other channels is then used to obtain the weight of that channel in all channels.

The process of obtaining the weight of a channel in all channels by using the global relationship importance of the channel relative to other channels is as follows: applying relational average pooled RAPs to matrices

To obtain a vector

Wherein

For the number of channels, at which time each element of the vector r represents the global relationship between each channel and all channels, the first element of the vector r

An element is defined as

。

；

All global relationships are converted to a weight for each channel using the softmax function.

；

In order to obtain an attention map

First, vector quantity

Is deformed into

Then broadcast as

I.e. the attention map obtained

. Finally, the two elements at the same position are multiplied by element-wise multiplication and the two elements at the same position are added by element-wise sum to obtain the final feature map

。

Can be expressed as:

。

the SGAM architecture: spatial attention and channel attention, which work in a similar manner, use global relationships between locations and channels to determine the importance of each location and channel, respectively. But is in phase with CGAMIn contrast, SGAM has three differences. First, let the quantity of the image

Is a characteristic diagram of the SGAM input,

and

the system has the same structure and comprises a 1 × 1 convolution, a BN layer and a ReLU function, and the number of channels is measured

Is reduced to

，

For the reduction factor, set to 2 in the experiment; by a function

And

obtain the tensor

And

and will be

Is deformed into

Will be

Is deformed into

(ii) a Matrix multiplication is then employed to determine the pairwise relationship between locations and obtain a matrix

，

。

Second, to determine the importance of a location, the matrix is aligned

Obtaining vectors using relational average pooled RAPs

(ii) a Vector quantity

To (1) a

The individual elements may be represented as:

。

thirdly, the invention firstly transforms the vector generated by the softmax function into the vector

Then broadcast it as

。

In CGAM and SGAM, the feature map to which attention is applied and the original feature map are added to obtain a final output feature map. There are two reasons for using the addition operation here. First, the normalization function used here is Softmax, which is a weighting valueMaps to a range of 0 to 1 and the sum of all weights is 1. Due to the existence of a large number of weights, the feature mapping element value output by the attention module is possibly small, the features of the original network are broken, and if the original feature map is not added, great difficulty is brought to training. Secondly, this addition operation is also highlighted

Reliable saliency information in (1). Experiments also show that the model has good performance through the residual structure. Compared with the model without addition operation, the model has 1.2%/1.5% improvement on mAP and Top-1, respectively.

For the Loss function, the most common Cross Entropy Loss function (Cross entry Loss) and triple Loss function (triple Loss) are used.

The cross entropy represents the difference between the true probability distribution and the predicted probability distribution. Can be expressed as:

；

wherein

Indicating the number of images in the small batch,

a real tag representing an ID is attached to the tag,

is shown as

The ID of the class predicts the logarithm.

The objective of the triplet penalty is to keep samples with the same label as close as possible in the embedding space, while samples with different labels are kept as far apart as possible. The invention adopts the triple loss batch-hard triple loss of the hard batch to randomly extract each small batch

An identity and

an image to meet the requirements of the batch-hard triplet loss. The loss can be defined as

；

Wherein the content of the first and second substances,

，

，

is a feature extracted from anchor point, positive sample, negative sample, respectively, will

Is set to 1.2, which helps to reduce intra-class variation and broaden inter-class variation to improve model performance.

The total training penalty is the sum of the cross-entropy penalty and the triplet penalty, consisting of

；

Wherein

And

is a hyperparameter that balances the two loss terms, both set to 1 in the experiment,

is the number of branches.

The invention has the technical effects that:

compared with the prior art, the global attention network model for vehicle weight recognition has the following advantages: the invention constructs a global attention network with three branches to extract a large amount of discriminative information; based on the global relationship of nodes, the invention constructs two global attention modules of CGAM and SGAM; the global relationship of the nodes is obtained by modeling the average pairwise relationship between the nodes and all other nodes, and then the global importance of the nodes is deduced, so that on one hand, the difficulty of attention learning is reduced, the calculation complexity is reduced, on the other hand, more reliable node importance measurement can be obtained through group evaluation, and more reliable significance information is extracted; according to the invention, the vehicle image is only horizontally divided into two parts on the local branch, so that the problems of part misalignment and local consistency damage can be solved to a great extent. The effectiveness of the algorithm is verified through experiments on two vehicle weight identification data sets. The performance of this process is superior to the SOTA process.

Drawings

FIG. 1 is a schematic diagram of the overall network architecture of the present invention;

FIG. 2 is a block diagram of the CGAM architecture of the present invention;

FIG. 3 shows the present invention

Architecture diagram;

fig. 4 is a diagram of the SGAM architecture of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings of the specification.

As shown in fig. 1, a global attention network model for vehicle weight recognition includes a backbone network, a local branch dividing a feature map into two parts, and two global branches having global attention modules; the backbone network uses ResNet50 as the basis of feature map extraction, multi-loss training is carried out by adjusting the stage and removing the original full connection layer, and a ResNet50 backbone network is split into 3 branches after res _ conv4_1 residual blocks; the global attention network model covers the entire body part of the vehicle image using a global average pooling GAP; the local branch only divides the vehicle characteristic diagram into two parts horizontally, and the problems of misalignment and damaged local consistency can be solved to a great extent.

In order to improve the resolution, the step size of down-sampling of res _ conv5_1 blocks of the Global Branch Global-C Branch and the Local Branch is changed from 2 to 1, and then a spatial Global attention module and a channel Global attention module are respectively added after res _ conv5 blocks of the two Global branches to extract reliable significance information and enhance the feature identification capability.

The feature dimensionality reduction module contains a 1 x 1 convolution, a BN layer, and a ReLU function that reduces the feature vector dimensionality to 256, thereby providing a compact feature representation. The network model is trained by applying triple loss and cross entropy loss on each branch, specifically, the triple loss is directly applied on the 256-dimensional feature vector, and a full connection layer is added behind the 256-dimensional feature vector to apply the cross entropy loss. In the testing stage, the features before the fully connected layer of the three branches are connected as the final output features.

The two global branches respectively have a channel global Attention module cgam (channel global Attention module) and a spatial global Attention module sgam (spatial global Attention module) for extracting more reliable saliency information.

As shown in FIG. 2, the CGAM architecture is illustrated, with the amount of design

Is a feature map of a CGAM input, in which

The number of the channels is the number of the channels,

and

the spatial height and width, respectively, of the tensor; slave function

And

to obtain tensor

And

and will be

Is deformed into

Will be

Is deformed into

，

And

the architecture is the same, consisting of two 1 × 1 convolutions and two 3 × 3 packet convolutions as well as two BatchNormal layers and two Relu activation functions. The above-mentioned

Which shows the pair-wise relationship of all channels,

writing into:

；

in addition, the matrix

Specifically, as shown in fig. 3, the number of channels of the input tensor is first convolved by 1 × 1

Halving, the signature was then divided into 32 groups by 3 x 3 block convolution, each group was convolved separately, and the signature size was kept constant by filling in one value. In addition, this 3 x 3 convolution keeps the number of channels unchanged. The BatchNormal (BN) layer is used for normalization and the Relu activation function is used to add non-linear factors. Then, 1 × 1 and 3 × 3 convolution is used again to make the number of channels consistent with the original input tensor.

The process of obtaining the weight of a channel in all channels by using the global relationship importance of the channel relative to other channels is as follows: applying Relational Average Pooling (RAP) to matrices

To obtain a vector

Wherein

An element is defined as

。

。

By usingsoftmaxThe function converts all global relationships into a weight for each channel.

；

In order to obtain an attention map

First, vector quantity

Is deformed into

Then broadcast it as

. Finally, applying element-wise multiplication and element-wise sum to the original feature map to obtain the final feature

。

Can be expressed as:

。

as shown in fig. 4, which illustrates the SGAM architecture, spatial attention and channel attention utilize global relationships between locations and between channels, respectively, to determine the importance of each location and channel, and they operate similarly. However, SGAM has three differences compared to CGAM. First, let the quantity of the image

Is a characteristic diagram of the SGAM input,

and

Is reduced to

，

For the reduction factor, set to 2 in the experiment; by a function

And

obtain the tensor

And

and will be

Is deformed into

Will be

Is deformed into

，

。

Second, to determine the importance of a location, the present invention applies to the matrix

Obtaining vectors using relational average pooled RAPs

. Vector quantity

To (1) a

The individual elements may be represented as:

。

Then broadcast it as

。

In CGAM and SGAM, the feature map to which attention is applied and the original feature map are added to obtain a final output feature map. There are two reasons for using an addition operation. First, the normalization function used here is Softmax, which is a function that maps weight values to a range of 0 to 1, and the sum of all weight values is 1. Due to the existence of a large number of weights, the feature mapping element value output by the attention module is possibly small, the features of the original network are broken, and if the original feature map is not added, great difficulty is brought to training. Secondly, this addition operation is also highlighted

For the Loss Function, the most common Cross Entropy Loss Function (Cross entry Loss Function) and triple Loss Function (triple Loss) are used.

The cross entropy represents the difference between the true probability distribution and the predicted probability distribution. Can be expressed as

；

Wherein

Indicating the number of images in the small batch,

a real tag representing an ID is attached to the tag,

is shown as

The ID of the class predicts the logarithm.

The objective of the triplet penalty is to keep samples with the same label as close as possible in the embedding space, while samples with different labels are kept as far apart as possible. The invention adopts the batch-hard triplet loss to randomly draw each small batch

An identity and

；

Wherein the content of the first and second substances,

，

，

；

Wherein

And

is the number of branches.

Experiment:

data set: the model of the invention was evaluated on two common vehicle weight identification data sets, including VeRi776 and VehicleID.

Veni 776, which consists of about 5 million 776 images of a car, taken by 20 cameras at different locations and at different viewing angles. The training set contains 576 vehicles and the test set contains the remaining 200 vehicles.

The VehicleID comprises day data captured by a plurality of real monitoring cameras distributed in a small city in China. 26267 cars (221763 pictures) were in the entire dataset. And extracting a small test set, a medium test set and a large test set according to the sizes of the test sets. In the reasoning stage, one image is randomly selected for each vehicle to serve as a gallery set, and other images serve as query images.

Evaluation indexes are as follows: on the basis of comprehensively evaluating each data set, two indexes of CMC and mAP are adopted to compare with the prior method. The CMC is an estimate of finding a correct match in the top K of the returned result. The mAP is a comprehensive index comprehensively considering the accuracy and the recall ratio of the query result.

The implementation details are as follows: the ResNet50 is selected as the backbone network that generates the features. The invention applies the same training strategy to both data sets. The RGB three channels for each pixel are normalized and the image size is adjusted to 256 x 256 before being input to the network. Randomly extracting from each mini-batch

Individual identities, each identity drawn at random

And (4) images to meet the requirement of triplet loss. In the experiment, set up

And

to train the model proposed by the present invention. For the margin parameter of the triplet loss, the invention was set to 1.2 in all experiments. Adam was used as the optimizer. For the learning rate strategy, the initial learning rate is set to 2e-4, decays to 2e-5 after 120 epoch, and further drops to 2e-6, 2e-7 at 220, 320 epoch for faster convergence. The entire training process lasted 450 epochs, each branch was trained with cross-entropy loss and a batch-hard triplet loss.

In the testing phase, the Veri776 dataset was tested in the form of an image-to-track. The minimum image-to-image distance is taken as the image-to-track distance by calculating the distance between the query image and all images in the gallery set. For the VehicleID data set, three test sets thereof were tested separately. And connecting the characteristics before the fully-connected layer of the three branches as final output characteristics.

The experimental results are as follows: the results of the proposed model and other most advanced models on both data sets were compared. The prior art has designed a local maximum occlusion representation (LOMO) to address the problem of visual and light changes. To obtain better results on the Complecars dataset, the Googlenet model was fine-tuned, and the fine-tuned model was called GoogleNet. And then, adopting SIFT, Color Name and GoogleLeNet characteristics to identify the vehicle in the union domain. RAM first divides the image horizontally into three parts and then embeds detailed visual cues in these local areas. To improve the ability to identify subtle differences, PRNs introduce local normalization (PR) constraints in the vehicle re-identification task. The parsing-based view-aware embedded networks (PVENs) may avoid mismatch of local features under different views. Generative Adaptive Networks (GAN) use Generative and discriminative models to learn each other to produce good output. The VAMI generates the characteristics of the different views with the help of the GAN. TAMR proposes a two-stage attention network to gradually focus on fine but distinct local details in the visual appearance of the vehicle and proposes a multi-grain rank-loss learning structured depth feature embedding.

The results of the experiments on VeRi776 and VehicleiD are shown in Table 1 and Table 2, respectively. Of all vision-based methods, the TGRA method of the present invention achieves the best results over others. It is found from table 1 that, firstly, TGRA is increased by 2.7% in maps and 0.1% in CMC @1 compared to PVEN. Secondly, the CMC @5 of the method of the present invention is already over 99.1%, which is a promising performance in real vehicle re-identification scenarios. Table 2 shows the results of the comparison on three test data sets of different scales. The TGRA of the invention is improved by 4.0% +, compared with PRN, on different test data on CMC @ 5. It should be noted that some advanced network models require the use of other auxiliary models, which increases the complexity of the algorithm. For example, PVEN uses U-Net to parse a vehicle into four different views. The PRN uses YOLO as a detection network for local positioning. TAMR uses STN to automatically position the windshield and the head of the vehicle. However, the model of the present invention still has better performance without using any auxiliary model.

The model of the invention reports 82.24% mAP, 95.77% CMC @1 and 99.11% CMC @5 on the VeRi776 test set. CMC @1 was reported to be 81.51%, 95.54%, 72.81%, CMC @5 was 96.38%, 93.69%, 91.01% on three test sets of VehicleiD. All results were obtained in single query mode, with no reordering. Table 1:

table 2:

ablation study: a number of experiments were performed on both data sets to verify the validity of the key module in the TGRA. The optimal structure of the model is determined by comparing the performances of different structures.

Validity of CGAM and SGAM: CGAM and SGAM are a channel global attention module and a spatial global attention module, respectively. The results on the test set of VeRi776 are shown in Table 3. Table 3:

the validity of the local branch is verified on the Veni 776 as shown in Table 4. "w/o" means none; "local" refers to a local branch of TGRA; "PART-3" and "PART-4" refer to references that divide the feature map into three or four PARTs, respectively.

Table 4:

the model of the invention consists of three branches, on two global branches, channel global attention and spatial global attention are used to extract reliable saliency information. The present invention verifies the effect of SGAM and CGAM on the model, respectively (table 3). As can be seen from Table 3, on the test set of VeRi776, "Baseline + SGAM" was improved by 0.6% and 0.6% at mAP and CMC @1, respectively, compared to Baseline. In addition, compared with Baseline, "Global-C (Branch)" is improved by 1.7% in mAP and 1.0% in CMC @ 1. Then, when both branches with CGAM and SGAM were trained simultaneously, the model yielded 5.0% and 1.6% improvement in mAP and CMC @1 compared to Baseline.

In addition, the invention also carries out qualitative analysis on the global attention module so as to more intuitively see the effectiveness of the global attention module. Experimental results show that the network with the global attention module can accurately find the same vehicle image. It is very difficult to identify the same vehicle when the query image and the target image are at different perspectives, but the model of the present invention can also identify the same vehicle well. Therefore, the global attention module of the present invention performs well in enhancing the difference pixels and suppressing the noise pixels.

Local branch verification: TGRA w/o local represents a TGRA model without local branches. In order to fully verify the effectiveness of the local branching proposed by the invention, the invention also carries out two experiments, wherein one experiment is to divide the characteristic diagram into three parts, and the other experiment is to divide the characteristic diagram into four parts. As can be seen from table 4, first, among the four models, the TGRA without local branches has the worst performance, indicating that local detail information is crucial in the vehicle re-identification task. Second, "TGRA (our)" increased 0.5% in maps and 0.6% in CMC @1 on the test set of VeRi776 compared to "TGRA (Part-3)". In addition, it can be seen that the larger the number of divisions, the worse the performance. This is due to misalignment and local disruption of consistency. However, the local branching proposed by the present invention can solve these problems to a large extent. The ablation experiment proves the effectiveness of the method.

The invention provides a global attention network with three branches for vehicle weight recognition, and the model can extract useful characteristics of a vehicle from multiple angles. Furthermore, on the local branch, in order to solve the problems of misalignment and local consistency failure to a large extent, the present invention divides the vehicle characteristic map into two parts uniformly. Finally, through the global attention module, the network can focus on the most significant part of the vehicle re-identification task, learning more identifying and robust features. The characteristics of these three branches are connected during the test phase to obtain better performance. Experiments show that the model of the invention is obviously superior to the best current model on the data sets of VeRi776 and VehicleiD.

Claims

1. A global attention network model for vehicle re-identification, characterized by: the system comprises a backbone network, a local branch for dividing a feature map into two parts and two global branches with global attention modules; the backbone network is split into 3 branches; the global attention network model obtains a feature vector on a feature map finally output by each branch by using a global average pooling GAP; the local branch only horizontally divides the vehicle characteristic diagram into two parts; the two global branches are respectively provided with a channel global attention module CGAM and a space global attention module SGAM, and a backbone network adopts ResNet 50;

the CGAM architecture: quantity of design

Is a feature map of a CGAM input, in which

The number of the channels is the number of the channels,

and

the spatial height and width, respectively, of the tensor; slave function

And

to obtain tensor

And

and will be

Is deformed into

Will be

Is deformed into

，

And

the system structure is the same, and each system consists of two 1 × 1 convolutions, two 3 × 3 packet convolutions, two BatchNormal layers and two Relu activation functions;

the SGAM architecture: quantity of design

Is a characteristic diagram of the SGAM input,

and

Is reduced to

，

For the reduction factor, set to 2 in the experiment; by a function

And

obtain the tensor

And

and will be

Is deformed into

Will be

Is deformed into

，

。

2. The global attention network model for vehicle weight recognition according to claim 1, wherein: changing the step size of downsampling of res _ conv5_1 blocks of the global branch and the local branch from 2 to 1, and then respectively adding a spatial global attention module and a channel global attention module after res _ conv5 blocks of the two global branches to extract reliable significance information and enhance the feature discrimination capability, wherein res _ conv5 represents the fifth layer of a Resnet50 network model; res _ conv5_1 represents the first component block in the fifth layer of the Resnet50 network model.

3. The global attention network model for vehicle weight recognition according to claim 1, wherein: function(s)

Architecture, using two 3 x 3 packet convolutions to increase the field of view and reduce the number of parameters, and then using matrix multiplication to obtain a matrix

Which shows the pair-wise relationship of all channels,

is written into

；

Matrix array

Each row of elements of (a) represents the pairwise relationship between each channel and all other channels, the average pairwise relationship of the channels is modeled to obtain the global relationship of the channels, and then the global relationship importance of one channel relative to the other channels is used to obtain the weight of that channel in all channels.

4. The global attention network model for vehicle weight recognition according to claim 3, wherein: the specific process of obtaining the weight of a channel in all channels by using the global relationship importance of the channel relative to other channels is as follows:

applying relational average pooled RAPs to matrices

To obtain a vector

Wherein

An element is defined as

；

；

Converting all global relations into the weight of each channel by adopting a softmax function;

；

first, vector

Is deformed into

Then broadcast as

I.e. the attention map obtained

(ii) a Finally, two element multiplication element-wise multiplication and two element addition element-wise multiplication and addition of two elements at the same position are applied to the original feature mapsum to obtain the final feature map

：

。

5. The global attention network model for vehicle weight recognition according to claim 1, wherein: for matrix

Obtaining vectors using relational average pooled RAPs

(ii) a Vector quantity

To (1) a

The individual elements may be represented as:

。