CN113420742B - Global attention network model for vehicle weight recognition - Google Patents

Global attention network model for vehicle weight recognition Download PDF

Info

Publication number
CN113420742B
CN113420742B CN202110977958.6A CN202110977958A CN113420742B CN 113420742 B CN113420742 B CN 113420742B CN 202110977958 A CN202110977958 A CN 202110977958A CN 113420742 B CN113420742 B CN 113420742B
Authority
CN
China
Prior art keywords
global
channels
network model
channel
branches
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110977958.6A
Other languages
Chinese (zh)
Other versions
CN113420742A (en
Inventor
庞希愚
田鑫
王成
姜刚武
郑艳丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Jiaotong University
Original Assignee
Shandong Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Jiaotong University filed Critical Shandong Jiaotong University
Priority to CN202110977958.6A priority Critical patent/CN113420742B/en
Publication of CN113420742A publication Critical patent/CN113420742A/en
Application granted granted Critical
Publication of CN113420742B publication Critical patent/CN113420742B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of vehicle identification, in particular to a global attention network model for vehicle weight identification, which comprises a backbone network, a local branch for dividing a feature map into two parts and two global branches with global attention modules, wherein the local branch is connected with the two local branches; the backbone network is split into 3 branches; the global attention network model extracts feature vectors using global average pooling to cover the entire vehicle information; the local branch only divides the feature map into two parts horizontally. The invention constructs a global attention network with three branches to extract a large amount of discriminative information; two global attention modules, namely CGAM and SGAM, are constructed, the global relation of nodes is modeled through the average pairwise relation among the nodes, the importance degree of the nodes is deduced, and the calculation complexity is reduced; the feature diagram is only horizontally divided into two parts on the local branch, and the problems of misalignment and local inconsistency are solved to a great extent.

Description

Global attention network model for vehicle weight recognition
Technical Field
The invention relates to the technical field of vehicle identification, in particular to a global attention network model for vehicle weight identification.
Background
The vehicle weight recognition refers to recognition of target vehicles under different cameras, plays an important role in intelligent transportation and smart cities, and has a lot of applications in real life. For example, in real traffic monitoring systems, vehicle re-identification may serve as a location, supervision, and criminal investigation for the target vehicle. With the rise of deep neural networks and the introduction of large data sets, improving the accuracy of vehicle re-identification has become a research hotspot in the fields of computer vision and multimedia in recent years. However, due to different viewing angles of the plurality of cameras and influences of illumination, shielding and the like, the intra-class feature distance becomes larger, the inter-class feature distance becomes smaller, and the difficulty of identification is further increased.
Pedestrian heavy identification and vehicle heavy identification are essentially the same and both belong to the image retrieval task. In recent years, Convolutional Neural Network (CNN) based methods have made great progress in pedestrian re-identification. Therefore, the CNN model applied to pedestrian re-recognition also has good performance in vehicle recognition. Most advanced CNN-based pedestrian re-recognition methods employ CNN models pre-trained on ImageNet and adjust them on the re-recognition dataset under different loss supervision.
CNN-based vehicle and pedestrian re-identification is often focused on extracting global features of people or vehicle images. In this way, complete feature information can be obtained globally, but the global features cannot well describe intra-class differences caused by factors such as viewing angles. In order to extract fine-grained local features, pedestrian re-identification Network models such as a PCB (partial-based Convolutional base) with local branches and an MGN (Multiple granular Network) are designed. These networks divide the feature map into several strips to extract local features. In addition, the latter combines local features with global features, further improving the performance of the model. For vehicle weight recognition, vehicles of the same vehicle type are substantially identical in global appearance. While in small areas, such as check marks, decorations, and usage marks, they may vary greatly. Therefore, the local detailed information of the automobile is also important for the task of identifying the weight of the automobile.
However, these local-based models have a common disadvantage in that they require relatively aligned body parts for the same person in order to learn significant local features. Although vehicle heavy recognition and pedestrian heavy recognition are both image retrieval problems in nature, the body part boundaries of vehicles are not as sharp as pedestrians, and the bodies of the same vehicle are greatly different from one another when viewed from different angles. On the other hand, strictly uniform partitioning of the feature map destroys local intra-consistency. And the destruction degree of local consistency is generally proportional to the number of local partitions, i.e. the greater the number of partitions, the easier it is to destroy local intra-consistency. This makes it difficult for deep neural networks to obtain meaningful fine-grained local information from the local, thereby degrading performance. Therefore, it is not feasible to simply apply the partial segmentation method in the task of re-identifying the pedestrian to the vehicle.
Attention mechanisms play an important role in the human perception system, helping people focus on identifying useful distinctive features, eliminating some noise and background interferences. For network models, the attention mechanism may focus the model on the target subject rather than the background, and is widely used in the task of re-recognition. Therefore, many networks with attention modules are proposed. However, they mainly build attention to nodes (channels, spatial positions) by direct convolution on self information, or directly reconstruct nodes using pairwise relationships between nodes, and do not consider that the global relationships between nodes have an important guiding role in building attention (importance) to nodes.
In the task of vehicle weight identification, illumination change, perspective change and resolution difference can be generated at different camera positions, so that the intra-class difference of the same vehicle at different visual angles is large, or the inter-class difference of different vehicles is small due to the same vehicle type. This greatly increases the difficulty of the vehicle re-identification task. The key to vehicle re-identification is the extraction of vehicle discriminating features. In order to better extract such features from the vehicle image and improve the accuracy of the recognition, it is necessary to propose a global attention network model for vehicle re-recognition.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a global attention network model for vehicle weight recognition, which can simply extract local fine information and solve the problems of local misalignment and local consistency damage to a great extent; and reliable attention of the nodes can be established according to the global relationship between the nodes, so that more credible significance information for vehicle re-identification is extracted.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a global attention network model for vehicle weight recognition comprises a backbone network, a local branch dividing a feature map into two parts and two global branches with global attention modules; the backbone network is split into 3 branches; the global attention network model extracts and obtains a feature vector on a final feature map output by each branch by using a global average pooling GAP (GAP) so as to cover the whole body information of the vehicle image; the local branch only horizontally divides the vehicle characteristic diagram into two parts, and the problems of misalignment and local consistency damage can be solved to a great extent.
The two global branches respectively have a channel global Attention module cgam (channel global Attention module) and a spatial global Attention module sgam (spatial global Attention module) for extracting more reliable saliency information. The backbone network employs the ResNet50 network model.
In order to improve the resolution, the step size of down-sampling of res _ conv5_1 blocks of Global Branch Global-C Branch and Local Branch is changed from 2 to 1, and then a spatial Global attention module and a channel Global attention module are respectively added after res _ conv5 blocks of two Global branches to extract reliable significance information and enhance the feature discrimination capability, wherein res _ cov5 represents the fifth layer of a Resnet50 network model; res _ cov5_1 denotes the first component block in the fifth layer of the Resnet50 network model.
After extracting the feature vectors using the global mean pooling GAP on each branch, the feature dimensionality reduction module, which contains 1 × 1 convolution, BN layers, and ReLU functions, reduces the feature vector dimensionality to 256, providing a compact feature representation. The network model is trained by applying triple loss and cross entropy loss on each branch, specifically, the triple loss is directly applied on the 256-dimensional feature vector, and a full connection layer is added behind the 256-dimensional feature vector to apply the cross entropy loss. In the testing stage, the features before the fully connected layer of the three branches are connected as the final output features.
The CGAM architecture: quantity of design
Figure 748409DEST_PATH_IMAGE001
Is a feature map of a CGAM input, in which
Figure 72075DEST_PATH_IMAGE002
The number of the channels is the number of the channels,
Figure 897817DEST_PATH_IMAGE003
and
Figure 355343DEST_PATH_IMAGE004
the spatial height and width, respectively, of the tensor; slave function
Figure 982765DEST_PATH_IMAGE005
And
Figure 601965DEST_PATH_IMAGE006
to obtain tensor
Figure 183512DEST_PATH_IMAGE007
And
Figure 444729DEST_PATH_IMAGE008
and will be
Figure 926657DEST_PATH_IMAGE009
Is deformed into
Figure 825081DEST_PATH_IMAGE010
Will be
Figure 782673DEST_PATH_IMAGE011
Is deformed into
Figure 926209DEST_PATH_IMAGE012
Figure 246332DEST_PATH_IMAGE013
And
Figure 2409DEST_PATH_IMAGE014
the architecture is identical, consisting of two 1 × 1 convolutions and two 3 × 3 packet convolutions, as well as two BatchNormal layers and two Relu activation functions. The above-mentioned
Figure 306352DEST_PATH_IMAGE005
Architecture, using two 3 x 3 packet convolutions to increase the receptive field and decrease the number of parameters. Then, matrix multiplication is used to obtain matrix
Figure 988000DEST_PATH_IMAGE015
It shows pairwise relationships of all channels.
Figure 411897DEST_PATH_IMAGE016
Writing into:
Figure 950326DEST_PATH_IMAGE017
in addition, the matrix
Figure 524920DEST_PATH_IMAGE018
Each row element of (a) represents a pair-wise relationship between each channel and all other channels. The average pairwise relationship of the channels is modeled to obtain a global relationship of the channels. The global relationship importance of one channel relative to the other channels is then used to obtain the weight of that channel in all channels.
The process of obtaining the weight of a channel in all channels by using the global relationship importance of the channel relative to other channels is as follows: applying relational average pooled RAPs to matrices
Figure 744680DEST_PATH_IMAGE016
To obtain a vector
Figure 554242DEST_PATH_IMAGE019
Wherein
Figure 122626DEST_PATH_IMAGE020
For the number of channels, at which time each element of the vector r represents the global relationship between each channel and all channels, the first element of the vector r
Figure 886314DEST_PATH_IMAGE021
An element is defined as
Figure 96715DEST_PATH_IMAGE022
Figure 498134DEST_PATH_IMAGE023
All global relationships are converted to a weight for each channel using the softmax function.
Figure 486688DEST_PATH_IMAGE024
In order to obtain an attention map
Figure 268830DEST_PATH_IMAGE025
First, vector quantity
Figure 659754DEST_PATH_IMAGE026
Is deformed into
Figure 148635DEST_PATH_IMAGE027
Then broadcast as
Figure 635986DEST_PATH_IMAGE028
I.e. the attention map obtained
Figure 295637DEST_PATH_IMAGE025
. Finally, the two elements at the same position are multiplied by element-wise multiplication and the two elements at the same position are added by element-wise sum to obtain the final feature map
Figure 885057DEST_PATH_IMAGE029
Figure 821921DEST_PATH_IMAGE029
Can be expressed as:
Figure 949014DEST_PATH_IMAGE030
the SGAM architecture: spatial attention and channel attention, which work in a similar manner, use global relationships between locations and channels to determine the importance of each location and channel, respectively. But is in phase with CGAMIn contrast, SGAM has three differences. First, let the quantity of the image
Figure 971328DEST_PATH_IMAGE031
Is a characteristic diagram of the SGAM input,
Figure 720366DEST_PATH_IMAGE032
and
Figure 275850DEST_PATH_IMAGE033
the system has the same structure and comprises a 1 × 1 convolution, a BN layer and a ReLU function, and the number of channels is measured
Figure 262260DEST_PATH_IMAGE034
Is reduced to
Figure 883122DEST_PATH_IMAGE035
Figure 59020DEST_PATH_IMAGE036
For the reduction factor, set to 2 in the experiment; by a function
Figure 469010DEST_PATH_IMAGE037
And
Figure 298426DEST_PATH_IMAGE038
obtain the tensor
Figure 875425DEST_PATH_IMAGE039
And
Figure 87970DEST_PATH_IMAGE040
and will be
Figure 853931DEST_PATH_IMAGE041
Is deformed into
Figure 773957DEST_PATH_IMAGE042
Will be
Figure 586055DEST_PATH_IMAGE040
Is deformed into
Figure 477656DEST_PATH_IMAGE043
(ii) a Matrix multiplication is then employed to determine the pairwise relationship between locations and obtain a matrix
Figure 865169DEST_PATH_IMAGE044
Figure 895442DEST_PATH_IMAGE045
Second, to determine the importance of a location, the matrix is aligned
Figure 539044DEST_PATH_IMAGE046
Obtaining vectors using relational average pooled RAPs
Figure 93391DEST_PATH_IMAGE047
(ii) a Vector quantity
Figure 20895DEST_PATH_IMAGE048
To (1) a
Figure 972802DEST_PATH_IMAGE049
The individual elements may be represented as:
Figure 618547DEST_PATH_IMAGE050
thirdly, the invention firstly transforms the vector generated by the softmax function into the vector
Figure 323722DEST_PATH_IMAGE051
Then broadcast it as
Figure 682897DEST_PATH_IMAGE052
In CGAM and SGAM, the feature map to which attention is applied and the original feature map are added to obtain a final output feature map. There are two reasons for using the addition operation here. First, the normalization function used here is Softmax, which is a weighting valueMaps to a range of 0 to 1 and the sum of all weights is 1. Due to the existence of a large number of weights, the feature mapping element value output by the attention module is possibly small, the features of the original network are broken, and if the original feature map is not added, great difficulty is brought to training. Secondly, this addition operation is also highlighted
Figure 320552DEST_PATH_IMAGE053
Reliable saliency information in (1). Experiments also show that the model has good performance through the residual structure. Compared with the model without addition operation, the model has 1.2%/1.5% improvement on mAP and Top-1, respectively.
For the Loss function, the most common Cross Entropy Loss function (Cross entry Loss) and triple Loss function (triple Loss) are used.
The cross entropy represents the difference between the true probability distribution and the predicted probability distribution. Can be expressed as:
Figure 204325DEST_PATH_IMAGE054
wherein
Figure 696880DEST_PATH_IMAGE055
Indicating the number of images in the small batch,
Figure 146447DEST_PATH_IMAGE056
a real tag representing an ID is attached to the tag,
Figure 735429DEST_PATH_IMAGE057
is shown as
Figure 737190DEST_PATH_IMAGE058
The ID of the class predicts the logarithm.
The objective of the triplet penalty is to keep samples with the same label as close as possible in the embedding space, while samples with different labels are kept as far apart as possible. The invention adopts the triple loss batch-hard triple loss of the hard batch to randomly extract each small batch
Figure 469654DEST_PATH_IMAGE059
An identity and
Figure 865738DEST_PATH_IMAGE060
an image to meet the requirements of the batch-hard triplet loss. The loss can be defined as
Figure 127086DEST_PATH_IMAGE061
Wherein the content of the first and second substances,
Figure 549234DEST_PATH_IMAGE062
Figure 491913DEST_PATH_IMAGE063
Figure 837444DEST_PATH_IMAGE064
is a feature extracted from anchor point, positive sample, negative sample, respectively, will
Figure 237070DEST_PATH_IMAGE065
Is set to 1.2, which helps to reduce intra-class variation and broaden inter-class variation to improve model performance.
The total training penalty is the sum of the cross-entropy penalty and the triplet penalty, consisting of
Figure 566420DEST_PATH_IMAGE066
Wherein
Figure 843949DEST_PATH_IMAGE067
And
Figure 309565DEST_PATH_IMAGE068
is a hyperparameter that balances the two loss terms, both set to 1 in the experiment,
Figure 883022DEST_PATH_IMAGE069
is the number of branches.
The invention has the technical effects that:
compared with the prior art, the global attention network model for vehicle weight recognition has the following advantages: the invention constructs a global attention network with three branches to extract a large amount of discriminative information; based on the global relationship of nodes, the invention constructs two global attention modules of CGAM and SGAM; the global relationship of the nodes is obtained by modeling the average pairwise relationship between the nodes and all other nodes, and then the global importance of the nodes is deduced, so that on one hand, the difficulty of attention learning is reduced, the calculation complexity is reduced, on the other hand, more reliable node importance measurement can be obtained through group evaluation, and more reliable significance information is extracted; according to the invention, the vehicle image is only horizontally divided into two parts on the local branch, so that the problems of part misalignment and local consistency damage can be solved to a great extent. The effectiveness of the algorithm is verified through experiments on two vehicle weight identification data sets. The performance of this process is superior to the SOTA process.
Drawings
FIG. 1 is a schematic diagram of the overall network architecture of the present invention;
FIG. 2 is a block diagram of the CGAM architecture of the present invention;
FIG. 3 shows the present invention
Figure 699669DEST_PATH_IMAGE005
Architecture diagram;
fig. 4 is a diagram of the SGAM architecture of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings of the specification.
As shown in fig. 1, a global attention network model for vehicle weight recognition includes a backbone network, a local branch dividing a feature map into two parts, and two global branches having global attention modules; the backbone network uses ResNet50 as the basis of feature map extraction, multi-loss training is carried out by adjusting the stage and removing the original full connection layer, and a ResNet50 backbone network is split into 3 branches after res _ conv4_1 residual blocks; the global attention network model covers the entire body part of the vehicle image using a global average pooling GAP; the local branch only divides the vehicle characteristic diagram into two parts horizontally, and the problems of misalignment and damaged local consistency can be solved to a great extent.
In order to improve the resolution, the step size of down-sampling of res _ conv5_1 blocks of the Global Branch Global-C Branch and the Local Branch is changed from 2 to 1, and then a spatial Global attention module and a channel Global attention module are respectively added after res _ conv5 blocks of the two Global branches to extract reliable significance information and enhance the feature identification capability.
The feature dimensionality reduction module contains a 1 x 1 convolution, a BN layer, and a ReLU function that reduces the feature vector dimensionality to 256, thereby providing a compact feature representation. The network model is trained by applying triple loss and cross entropy loss on each branch, specifically, the triple loss is directly applied on the 256-dimensional feature vector, and a full connection layer is added behind the 256-dimensional feature vector to apply the cross entropy loss. In the testing stage, the features before the fully connected layer of the three branches are connected as the final output features.
The two global branches respectively have a channel global Attention module cgam (channel global Attention module) and a spatial global Attention module sgam (spatial global Attention module) for extracting more reliable saliency information.
As shown in FIG. 2, the CGAM architecture is illustrated, with the amount of design
Figure 249730DEST_PATH_IMAGE070
Is a feature map of a CGAM input, in which
Figure 835432DEST_PATH_IMAGE002
The number of the channels is the number of the channels,
Figure 576861DEST_PATH_IMAGE003
and
Figure 880803DEST_PATH_IMAGE004
the spatial height and width, respectively, of the tensor; slave function
Figure 500134DEST_PATH_IMAGE071
And
Figure 346868DEST_PATH_IMAGE072
to obtain tensor
Figure 744351DEST_PATH_IMAGE073
And
Figure 381262DEST_PATH_IMAGE074
and will be
Figure 335443DEST_PATH_IMAGE075
Is deformed into
Figure 676163DEST_PATH_IMAGE076
Will be
Figure 995280DEST_PATH_IMAGE011
Is deformed into
Figure 539394DEST_PATH_IMAGE012
Figure 295053DEST_PATH_IMAGE013
And
Figure 319641DEST_PATH_IMAGE014
the architecture is the same, consisting of two 1 × 1 convolutions and two 3 × 3 packet convolutions as well as two BatchNormal layers and two Relu activation functions. The above-mentioned
Figure 324506DEST_PATH_IMAGE013
Architecture, using two 3 x 3 packet convolutions to increase the receptive field and decrease the number of parameters. Then, matrix multiplication is used to obtain matrix
Figure 74025DEST_PATH_IMAGE015
Which shows the pair-wise relationship of all channels,
Figure 556959DEST_PATH_IMAGE016
writing into:
Figure 842578DEST_PATH_IMAGE077
in addition, the matrix
Figure 752765DEST_PATH_IMAGE018
Each row element of (a) represents a pair-wise relationship between each channel and all other channels. The average pairwise relationship of the channels is modeled to obtain a global relationship of the channels. The global relationship importance of one channel relative to the other channels is then used to obtain the weight of that channel in all channels.
Specifically, as shown in fig. 3, the number of channels of the input tensor is first convolved by 1 × 1
Figure 992510DEST_PATH_IMAGE002
Halving, the signature was then divided into 32 groups by 3 x 3 block convolution, each group was convolved separately, and the signature size was kept constant by filling in one value. In addition, this 3 x 3 convolution keeps the number of channels unchanged. The BatchNormal (BN) layer is used for normalization and the Relu activation function is used to add non-linear factors. Then, 1 × 1 and 3 × 3 convolution is used again to make the number of channels consistent with the original input tensor.
The process of obtaining the weight of a channel in all channels by using the global relationship importance of the channel relative to other channels is as follows: applying Relational Average Pooling (RAP) to matrices
Figure 13556DEST_PATH_IMAGE018
To obtain a vector
Figure 153681DEST_PATH_IMAGE078
Wherein
Figure 500349DEST_PATH_IMAGE020
For the number of channels, at which time each element of the vector r represents the global relationship between each channel and all channels, the first element of the vector r
Figure 490039DEST_PATH_IMAGE079
An element is defined as
Figure 49196DEST_PATH_IMAGE080
Figure 43828DEST_PATH_IMAGE081
By usingsoftmaxThe function converts all global relationships into a weight for each channel.
Figure 295818DEST_PATH_IMAGE082
In order to obtain an attention map
Figure 775734DEST_PATH_IMAGE025
First, vector quantity
Figure 404162DEST_PATH_IMAGE083
Is deformed into
Figure 987721DEST_PATH_IMAGE027
Then broadcast it as
Figure 676191DEST_PATH_IMAGE028
. Finally, applying element-wise multiplication and element-wise sum to the original feature map to obtain the final feature
Figure 374895DEST_PATH_IMAGE084
Figure 275855DEST_PATH_IMAGE029
Can be expressed as:
Figure 245079DEST_PATH_IMAGE085
as shown in fig. 4, which illustrates the SGAM architecture, spatial attention and channel attention utilize global relationships between locations and between channels, respectively, to determine the importance of each location and channel, and they operate similarly. However, SGAM has three differences compared to CGAM. First, let the quantity of the image
Figure 104451DEST_PATH_IMAGE086
Is a characteristic diagram of the SGAM input,
Figure 27801DEST_PATH_IMAGE087
and
Figure 263610DEST_PATH_IMAGE088
the system has the same structure and comprises a 1 × 1 convolution, a BN layer and a ReLU function, and the number of channels is measured
Figure 946395DEST_PATH_IMAGE034
Is reduced to
Figure 225936DEST_PATH_IMAGE035
Figure 384384DEST_PATH_IMAGE089
For the reduction factor, set to 2 in the experiment; by a function
Figure 112300DEST_PATH_IMAGE090
And
Figure 873759DEST_PATH_IMAGE038
obtain the tensor
Figure 340512DEST_PATH_IMAGE039
And
Figure 736990DEST_PATH_IMAGE040
and will be
Figure 189968DEST_PATH_IMAGE041
Is deformed into
Figure 345880DEST_PATH_IMAGE042
Will be
Figure 858901DEST_PATH_IMAGE040
Is deformed into
Figure 991942DEST_PATH_IMAGE091
(ii) a Matrix multiplication is then employed to determine the pairwise relationship between locations and obtain a matrix
Figure 858398DEST_PATH_IMAGE092
Figure 760495DEST_PATH_IMAGE093
Second, to determine the importance of a location, the present invention applies to the matrix
Figure 555669DEST_PATH_IMAGE046
Obtaining vectors using relational average pooled RAPs
Figure 520214DEST_PATH_IMAGE047
. Vector quantity
Figure 439629DEST_PATH_IMAGE048
To (1) a
Figure 337177DEST_PATH_IMAGE049
The individual elements may be represented as:
Figure 300323DEST_PATH_IMAGE094
thirdly, the invention firstly transforms the vector generated by the softmax function into the vector
Figure 17743DEST_PATH_IMAGE095
Then broadcast it as
Figure 944111DEST_PATH_IMAGE096
In CGAM and SGAM, the feature map to which attention is applied and the original feature map are added to obtain a final output feature map. There are two reasons for using an addition operation. First, the normalization function used here is Softmax, which is a function that maps weight values to a range of 0 to 1, and the sum of all weight values is 1. Due to the existence of a large number of weights, the feature mapping element value output by the attention module is possibly small, the features of the original network are broken, and if the original feature map is not added, great difficulty is brought to training. Secondly, this addition operation is also highlighted
Figure 807418DEST_PATH_IMAGE053
Reliable saliency information in (1). Experiments also show that the model has good performance through the residual structure. Compared with the model without addition operation, the model has 1.2%/1.5% improvement on mAP and Top-1, respectively.
For the Loss Function, the most common Cross Entropy Loss Function (Cross entry Loss Function) and triple Loss Function (triple Loss) are used.
The cross entropy represents the difference between the true probability distribution and the predicted probability distribution. Can be expressed as
Figure 223356DEST_PATH_IMAGE097
Wherein
Figure 569018DEST_PATH_IMAGE098
Indicating the number of images in the small batch,
Figure 564656DEST_PATH_IMAGE099
a real tag representing an ID is attached to the tag,
Figure 436797DEST_PATH_IMAGE100
is shown as
Figure 7324DEST_PATH_IMAGE101
The ID of the class predicts the logarithm.
The objective of the triplet penalty is to keep samples with the same label as close as possible in the embedding space, while samples with different labels are kept as far apart as possible. The invention adopts the batch-hard triplet loss to randomly draw each small batch
Figure 168178DEST_PATH_IMAGE059
An identity and
Figure 374032DEST_PATH_IMAGE060
an image to meet the requirements of the batch-hard triplet loss. The loss can be defined as
Figure 959734DEST_PATH_IMAGE102
Wherein the content of the first and second substances,
Figure 438513DEST_PATH_IMAGE103
Figure 742456DEST_PATH_IMAGE063
Figure 96208DEST_PATH_IMAGE064
is a feature extracted from anchor point, positive sample, negative sample, respectively, will
Figure 536416DEST_PATH_IMAGE065
Is set to 1.2, which helps to reduce intra-class variation and broaden inter-class variation to improve model performance.
The total training penalty is the sum of the cross-entropy penalty and the triplet penalty, consisting of
Figure 714326DEST_PATH_IMAGE104
Wherein
Figure 239985DEST_PATH_IMAGE105
And
Figure 397428DEST_PATH_IMAGE068
is a hyperparameter that balances the two loss terms, both set to 1 in the experiment,
Figure 957722DEST_PATH_IMAGE069
is the number of branches.
Experiment:
data set: the model of the invention was evaluated on two common vehicle weight identification data sets, including VeRi776 and VehicleID.
Veni 776, which consists of about 5 million 776 images of a car, taken by 20 cameras at different locations and at different viewing angles. The training set contains 576 vehicles and the test set contains the remaining 200 vehicles.
The VehicleID comprises day data captured by a plurality of real monitoring cameras distributed in a small city in China. 26267 cars (221763 pictures) were in the entire dataset. And extracting a small test set, a medium test set and a large test set according to the sizes of the test sets. In the reasoning stage, one image is randomly selected for each vehicle to serve as a gallery set, and other images serve as query images.
Evaluation indexes are as follows: on the basis of comprehensively evaluating each data set, two indexes of CMC and mAP are adopted to compare with the prior method. The CMC is an estimate of finding a correct match in the top K of the returned result. The mAP is a comprehensive index comprehensively considering the accuracy and the recall ratio of the query result.
The implementation details are as follows: the ResNet50 is selected as the backbone network that generates the features. The invention applies the same training strategy to both data sets. The RGB three channels for each pixel are normalized and the image size is adjusted to 256 x 256 before being input to the network. Randomly extracting from each mini-batch
Figure 790023DEST_PATH_IMAGE059
Individual identities, each identity drawn at random
Figure 334137DEST_PATH_IMAGE060
And (4) images to meet the requirement of triplet loss. In the experiment, set up
Figure 29691DEST_PATH_IMAGE106
And
Figure 444492DEST_PATH_IMAGE107
to train the model proposed by the present invention. For the margin parameter of the triplet loss, the invention was set to 1.2 in all experiments. Adam was used as the optimizer. For the learning rate strategy, the initial learning rate is set to 2e-4, decays to 2e-5 after 120 epoch, and further drops to 2e-6, 2e-7 at 220, 320 epoch for faster convergence. The entire training process lasted 450 epochs, each branch was trained with cross-entropy loss and a batch-hard triplet loss.
In the testing phase, the Veri776 dataset was tested in the form of an image-to-track. The minimum image-to-image distance is taken as the image-to-track distance by calculating the distance between the query image and all images in the gallery set. For the VehicleID data set, three test sets thereof were tested separately. And connecting the characteristics before the fully-connected layer of the three branches as final output characteristics.
The experimental results are as follows: the results of the proposed model and other most advanced models on both data sets were compared. The prior art has designed a local maximum occlusion representation (LOMO) to address the problem of visual and light changes. To obtain better results on the Complecars dataset, the Googlenet model was fine-tuned, and the fine-tuned model was called GoogleNet. And then, adopting SIFT, Color Name and GoogleLeNet characteristics to identify the vehicle in the union domain. RAM first divides the image horizontally into three parts and then embeds detailed visual cues in these local areas. To improve the ability to identify subtle differences, PRNs introduce local normalization (PR) constraints in the vehicle re-identification task. The parsing-based view-aware embedded networks (PVENs) may avoid mismatch of local features under different views. Generative Adaptive Networks (GAN) use Generative and discriminative models to learn each other to produce good output. The VAMI generates the characteristics of the different views with the help of the GAN. TAMR proposes a two-stage attention network to gradually focus on fine but distinct local details in the visual appearance of the vehicle and proposes a multi-grain rank-loss learning structured depth feature embedding.
The results of the experiments on VeRi776 and VehicleiD are shown in Table 1 and Table 2, respectively. Of all vision-based methods, the TGRA method of the present invention achieves the best results over others. It is found from table 1 that, firstly, TGRA is increased by 2.7% in maps and 0.1% in CMC @1 compared to PVEN. Secondly, the CMC @5 of the method of the present invention is already over 99.1%, which is a promising performance in real vehicle re-identification scenarios. Table 2 shows the results of the comparison on three test data sets of different scales. The TGRA of the invention is improved by 4.0% +, compared with PRN, on different test data on CMC @ 5. It should be noted that some advanced network models require the use of other auxiliary models, which increases the complexity of the algorithm. For example, PVEN uses U-Net to parse a vehicle into four different views. The PRN uses YOLO as a detection network for local positioning. TAMR uses STN to automatically position the windshield and the head of the vehicle. However, the model of the present invention still has better performance without using any auxiliary model.
The model of the invention reports 82.24% mAP, 95.77% CMC @1 and 99.11% CMC @5 on the VeRi776 test set. CMC @1 was reported to be 81.51%, 95.54%, 72.81%, CMC @5 was 96.38%, 93.69%, 91.01% on three test sets of VehicleiD. All results were obtained in single query mode, with no reordering. Table 1:
Figure 433046DEST_PATH_IMAGE108
table 2:
Figure 667718DEST_PATH_IMAGE110
ablation study: a number of experiments were performed on both data sets to verify the validity of the key module in the TGRA. The optimal structure of the model is determined by comparing the performances of different structures.
Validity of CGAM and SGAM: CGAM and SGAM are a channel global attention module and a spatial global attention module, respectively. The results on the test set of VeRi776 are shown in Table 3. Table 3:
Figure 432543DEST_PATH_IMAGE111
the validity of the local branch is verified on the Veni 776 as shown in Table 4. "w/o" means none; "local" refers to a local branch of TGRA; "PART-3" and "PART-4" refer to references that divide the feature map into three or four PARTs, respectively.
Table 4:
Figure 701850DEST_PATH_IMAGE112
the model of the invention consists of three branches, on two global branches, channel global attention and spatial global attention are used to extract reliable saliency information. The present invention verifies the effect of SGAM and CGAM on the model, respectively (table 3). As can be seen from Table 3, on the test set of VeRi776, "Baseline + SGAM" was improved by 0.6% and 0.6% at mAP and CMC @1, respectively, compared to Baseline. In addition, compared with Baseline, "Global-C (Branch)" is improved by 1.7% in mAP and 1.0% in CMC @ 1. Then, when both branches with CGAM and SGAM were trained simultaneously, the model yielded 5.0% and 1.6% improvement in mAP and CMC @1 compared to Baseline.
In addition, the invention also carries out qualitative analysis on the global attention module so as to more intuitively see the effectiveness of the global attention module. Experimental results show that the network with the global attention module can accurately find the same vehicle image. It is very difficult to identify the same vehicle when the query image and the target image are at different perspectives, but the model of the present invention can also identify the same vehicle well. Therefore, the global attention module of the present invention performs well in enhancing the difference pixels and suppressing the noise pixels.
Local branch verification: TGRA w/o local represents a TGRA model without local branches. In order to fully verify the effectiveness of the local branching proposed by the invention, the invention also carries out two experiments, wherein one experiment is to divide the characteristic diagram into three parts, and the other experiment is to divide the characteristic diagram into four parts. As can be seen from table 4, first, among the four models, the TGRA without local branches has the worst performance, indicating that local detail information is crucial in the vehicle re-identification task. Second, "TGRA (our)" increased 0.5% in maps and 0.6% in CMC @1 on the test set of VeRi776 compared to "TGRA (Part-3)". In addition, it can be seen that the larger the number of divisions, the worse the performance. This is due to misalignment and local disruption of consistency. However, the local branching proposed by the present invention can solve these problems to a large extent. The ablation experiment proves the effectiveness of the method.
The invention provides a global attention network with three branches for vehicle weight recognition, and the model can extract useful characteristics of a vehicle from multiple angles. Furthermore, on the local branch, in order to solve the problems of misalignment and local consistency failure to a large extent, the present invention divides the vehicle characteristic map into two parts uniformly. Finally, through the global attention module, the network can focus on the most significant part of the vehicle re-identification task, learning more identifying and robust features. The characteristics of these three branches are connected during the test phase to obtain better performance. Experiments show that the model of the invention is obviously superior to the best current model on the data sets of VeRi776 and VehicleiD.

Claims (5)

1. A global attention network model for vehicle re-identification, characterized by: the system comprises a backbone network, a local branch for dividing a feature map into two parts and two global branches with global attention modules; the backbone network is split into 3 branches; the global attention network model obtains a feature vector on a feature map finally output by each branch by using a global average pooling GAP; the local branch only horizontally divides the vehicle characteristic diagram into two parts; the two global branches are respectively provided with a channel global attention module CGAM and a space global attention module SGAM, and a backbone network adopts ResNet 50;
the CGAM architecture: quantity of design
Figure 141185DEST_PATH_IMAGE001
Is a feature map of a CGAM input, in which
Figure 252360DEST_PATH_IMAGE002
The number of the channels is the number of the channels,
Figure 715572DEST_PATH_IMAGE003
and
Figure 621211DEST_PATH_IMAGE004
the spatial height and width, respectively, of the tensor; slave function
Figure 904293DEST_PATH_IMAGE005
And
Figure 553581DEST_PATH_IMAGE006
to obtain tensor
Figure 340140DEST_PATH_IMAGE007
And
Figure 682259DEST_PATH_IMAGE008
and will be
Figure 443849DEST_PATH_IMAGE009
Is deformed into
Figure 631248DEST_PATH_IMAGE010
Will be
Figure 803472DEST_PATH_IMAGE011
Is deformed into
Figure 50914DEST_PATH_IMAGE012
Figure 308589DEST_PATH_IMAGE013
And
Figure 299679DEST_PATH_IMAGE014
the system structure is the same, and each system consists of two 1 × 1 convolutions, two 3 × 3 packet convolutions, two BatchNormal layers and two Relu activation functions;
the SGAM architecture: quantity of design
Figure 326409DEST_PATH_IMAGE015
Is a characteristic diagram of the SGAM input,
Figure 744752DEST_PATH_IMAGE016
and
Figure 633598DEST_PATH_IMAGE017
the system has the same structure and comprises a 1 × 1 convolution, a BN layer and a ReLU function, and the number of channels is measured
Figure 677647DEST_PATH_IMAGE018
Is reduced to
Figure 309616DEST_PATH_IMAGE019
Figure 148128DEST_PATH_IMAGE020
For the reduction factor, set to 2 in the experiment; by a function
Figure 131128DEST_PATH_IMAGE021
And
Figure 713288DEST_PATH_IMAGE022
obtain the tensor
Figure 199764DEST_PATH_IMAGE023
And
Figure 477686DEST_PATH_IMAGE024
and will be
Figure 682402DEST_PATH_IMAGE023
Is deformed into
Figure 740357DEST_PATH_IMAGE025
Will be
Figure 940394DEST_PATH_IMAGE026
Is deformed into
Figure 386288DEST_PATH_IMAGE027
(ii) a Matrix multiplication is then employed to determine the pairwise relationship between locations and obtain a matrix
Figure 468513DEST_PATH_IMAGE028
Figure 408787DEST_PATH_IMAGE029
2. The global attention network model for vehicle weight recognition according to claim 1, wherein: changing the step size of downsampling of res _ conv5_1 blocks of the global branch and the local branch from 2 to 1, and then respectively adding a spatial global attention module and a channel global attention module after res _ conv5 blocks of the two global branches to extract reliable significance information and enhance the feature discrimination capability, wherein res _ conv5 represents the fifth layer of a Resnet50 network model; res _ conv5_1 represents the first component block in the fifth layer of the Resnet50 network model.
3. The global attention network model for vehicle weight recognition according to claim 1, wherein: function(s)
Figure 122053DEST_PATH_IMAGE030
Architecture, using two 3 x 3 packet convolutions to increase the field of view and reduce the number of parameters, and then using matrix multiplication to obtain a matrix
Figure 224001DEST_PATH_IMAGE031
Which shows the pair-wise relationship of all channels,
Figure 386998DEST_PATH_IMAGE032
is written into
Figure 130963DEST_PATH_IMAGE033
Matrix array
Figure 695806DEST_PATH_IMAGE034
Each row of elements of (a) represents the pairwise relationship between each channel and all other channels, the average pairwise relationship of the channels is modeled to obtain the global relationship of the channels, and then the global relationship importance of one channel relative to the other channels is used to obtain the weight of that channel in all channels.
4. The global attention network model for vehicle weight recognition according to claim 3, wherein: the specific process of obtaining the weight of a channel in all channels by using the global relationship importance of the channel relative to other channels is as follows:
applying relational average pooled RAPs to matrices
Figure 968655DEST_PATH_IMAGE032
To obtain a vector
Figure 494314DEST_PATH_IMAGE035
Wherein
Figure 760080DEST_PATH_IMAGE036
For the number of channels, at which time each element of the vector r represents the global relationship between each channel and all channels, the first element of the vector r
Figure 195740DEST_PATH_IMAGE037
An element is defined as
Figure 903407DEST_PATH_IMAGE038
Figure 57308DEST_PATH_IMAGE039
Converting all global relations into the weight of each channel by adopting a softmax function;
Figure 267709DEST_PATH_IMAGE040
first, vector
Figure 807144DEST_PATH_IMAGE041
Is deformed into
Figure 687375DEST_PATH_IMAGE042
Then broadcast as
Figure 312260DEST_PATH_IMAGE043
I.e. the attention map obtained
Figure 201719DEST_PATH_IMAGE044
(ii) a Finally, two element multiplication element-wise multiplication and two element addition element-wise multiplication and addition of two elements at the same position are applied to the original feature mapsum to obtain the final feature map
Figure 254382DEST_PATH_IMAGE045
Figure 39935DEST_PATH_IMAGE046
5. The global attention network model for vehicle weight recognition according to claim 1, wherein: for matrix
Figure 152117DEST_PATH_IMAGE047
Obtaining vectors using relational average pooled RAPs
Figure 704321DEST_PATH_IMAGE048
(ii) a Vector quantity
Figure 969080DEST_PATH_IMAGE049
To (1) a
Figure 443311DEST_PATH_IMAGE050
The individual elements may be represented as:
Figure 793521DEST_PATH_IMAGE051
CN202110977958.6A 2021-08-25 2021-08-25 Global attention network model for vehicle weight recognition Active CN113420742B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110977958.6A CN113420742B (en) 2021-08-25 2021-08-25 Global attention network model for vehicle weight recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110977958.6A CN113420742B (en) 2021-08-25 2021-08-25 Global attention network model for vehicle weight recognition

Publications (2)

Publication Number Publication Date
CN113420742A CN113420742A (en) 2021-09-21
CN113420742B true CN113420742B (en) 2022-01-11

Family

ID=77719317

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110977958.6A Active CN113420742B (en) 2021-08-25 2021-08-25 Global attention network model for vehicle weight recognition

Country Status (1)

Country Link
CN (1) CN113420742B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113989836B (en) * 2021-10-20 2022-11-29 华南农业大学 Dairy cow face weight identification method, system, equipment and medium based on deep learning
CN113822246B (en) * 2021-11-22 2022-02-18 山东交通学院 Vehicle weight identification method based on global reference attention mechanism
CN114663861B (en) * 2022-05-17 2022-08-26 山东交通学院 Vehicle re-identification method based on dimension decoupling and non-local relation
CN116110076B (en) * 2023-02-09 2023-11-07 国网江苏省电力有限公司苏州供电分公司 Power transmission aerial work personnel identity re-identification method and system based on mixed granularity network
CN116052218B (en) * 2023-02-13 2023-07-18 中国矿业大学 Pedestrian re-identification method
CN116311105B (en) * 2023-05-15 2023-09-19 山东交通学院 Vehicle re-identification method based on inter-sample context guidance network
CN116704453B (en) * 2023-08-08 2023-11-28 山东交通学院 Method for vehicle re-identification by adopting self-adaptive division and priori reinforcement part learning network

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084139B (en) * 2019-04-04 2021-02-26 长沙千视通智能科技有限公司 Vehicle weight recognition method based on multi-branch deep learning
CN110070073A (en) * 2019-05-07 2019-07-30 国家广播电视总局广播电视科学研究院 Pedestrian's recognition methods again of global characteristics and local feature based on attention mechanism
CN111325111A (en) * 2020-01-23 2020-06-23 同济大学 Pedestrian re-identification method integrating inverse attention and multi-scale deep supervision
CN111401177B (en) * 2020-03-09 2023-04-07 山东大学 End-to-end behavior recognition method and system based on adaptive space-time attention mechanism
CN111507217A (en) * 2020-04-08 2020-08-07 南京邮电大学 Pedestrian re-identification method based on local resolution feature fusion
CN111368815B (en) * 2020-05-28 2020-09-04 之江实验室 Pedestrian re-identification method based on multi-component self-attention mechanism
CN111931624B (en) * 2020-08-03 2023-02-07 重庆邮电大学 Attention mechanism-based lightweight multi-branch pedestrian heavy identification method and system

Also Published As

Publication number Publication date
CN113420742A (en) 2021-09-21

Similar Documents

Publication Publication Date Title
CN113420742B (en) Global attention network model for vehicle weight recognition
Chen et al. Partition and reunion: A two-branch neural network for vehicle re-identification.
CN108197326B (en) Vehicle retrieval method and device, electronic equipment and storage medium
CN111080629A (en) Method for detecting image splicing tampering
CN106557579B (en) Vehicle model retrieval system and method based on convolutional neural network
CN112966137B (en) Image retrieval method and system based on global and local feature rearrangement
CN111507217A (en) Pedestrian re-identification method based on local resolution feature fusion
CN111709313B (en) Pedestrian re-identification method based on local and channel combination characteristics
CN108154133B (en) Face portrait-photo recognition method based on asymmetric joint learning
CN111582339B (en) Vehicle detection and recognition method based on deep learning
CN108764096B (en) Pedestrian re-identification system and method
CN109635726B (en) Landslide identification method based on combination of symmetric deep network and multi-scale pooling
CN113592007B (en) Knowledge distillation-based bad picture identification system and method, computer and storage medium
CN112785480B (en) Image splicing tampering detection method based on frequency domain transformation and residual error feedback module
Zang et al. Traffic lane detection using fully convolutional neural network
CN109325407B (en) Optical remote sensing video target detection method based on F-SSD network filtering
CN114913498A (en) Parallel multi-scale feature aggregation lane line detection method based on key point estimation
CN110826415A (en) Method and device for re-identifying vehicles in scene image
CN112861605A (en) Multi-person gait recognition method based on space-time mixed characteristics
CN105184299A (en) Vehicle body color identification method based on local restriction linearity coding
CN113269224A (en) Scene image classification method, system and storage medium
CN117197763A (en) Road crack detection method and system based on cross attention guide feature alignment network
Elkerdawy et al. Fine-grained vehicle classification with unsupervised parts co-occurrence learning
CN110458234B (en) Vehicle searching method with map based on deep learning
CN112418262A (en) Vehicle re-identification method, client and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant