CN116644788B

CN116644788B - Local refinement and global reinforcement network for vehicle re-identification

Info

Publication number: CN116644788B
Application number: CN202310926540.1A
Authority: CN
Inventors: 郑美凤; 王成; 张峰; 孙珂; 李曦; 周厚仁; 庞希愚; 周晓颖; 田佳琛
Original assignee: Shandong Jiaotong University
Current assignee: Shandong Jiaotong University
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-10-03
Anticipated expiration: 2043-07-27
Also published as: CN116644788A

Abstract

The invention relates to the technical field of vehicle re-identification, in particular to a local refinement and global reinforcement network for vehicle re-identification, which is a three-branch network and learns the identified local features and global features of a vehicle through a local refinement module and a global reinforcement module. Wherein the local refinement module is intended to learn a refined local representation, capturing rich correlation information between neighboring pixels through interaction of the target pixel with its nearest pixels; the global emphasis module aims to learn the enhanced global representation, first to focus the attention of the target pixels into individual windows to emphasize important remote dependencies within the region, and then to aggregate globally significant remote connections by cross-window interactions. The local refinement module and the global reinforcement module are matched with each other, so that the identification local information and the identification overall information of the vehicle can be effectively extracted.

Description

Local refinement and global reinforcement network for vehicle re-identification

Technical Field

The invention relates to the technical field of vehicle re-identification, in particular to a local refinement and global reinforcement network for vehicle re-identification.

Background

The vehicle re-identification aims at retrieving the same vehicle image as the query ID from the image library. At present, the task of vehicle re-identification mainly faces two challenges of large intra-class difference and small inter-class difference. Learning the discriminative local and global features of a vehicle is critical to solving both challenges. Self-attention mechanism is a special attention, which mainly contains two forms of full self-attention (full self-attention) and local self-attention (local self-attention), and has shown great potential in the field of computer vision. However, remote connections in the global context of full self-section modeling are typically weak, which limits the learning of overall information for the vehicle; the windowed mode of localsif-attitution prevents adequate learning of the local detail information of the vehicle.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and proposes to design a local refinement and global reinforcement network for vehicle re-identification.

The technical scheme adopted for solving the technical problems is as follows:

a local refinement and global reinforcement network for vehicle re-identification, employing the residual block preceding res_conv4_2 of res net-50 as the backbone for feature extraction, the subsequent part of res_conv4_1 residual block is divided into three branches: GL Branch, GS Branch, and LR Branch, and remove the downsampling operation of the res_conv5_1 residual block of the three branches to provide a larger spatial view;

GL Branch without attention module is used to learn general information of the vehicle as a whole;

adding a global reinforcement module after the res_conv5 layer of GS Branch to learn a reinforced global representation of the vehicle;

applying a local refinement module after the res_conv5 layer of LR Branch to learn a refined local representation of the vehicle;

wherein, the local refinement module aims at capturing the identifying local information of the vehicle, and the structure is as follows:

with characteristic diagramsIs an input to the module which, among other things,C、H、Wthe number, height and width of the channels respectively representing the feature map; using an output channel number of 3CIs convolved with 1*1 to obtainxQuery tensor->Tensor of keySum tensor->：/>；

Is provided withxMiddle (f)iThe query of individual pixels isRepresentingx ^q In positioniFeature vectors at the location; first, theiIndividual pixels +.>The set of keys in the neighborhood is marked +.>Representingx ^k Middle and positioniClosest tok ² Feature vectors for the individual locations.

To realize the firstiWith a pixel nearest to itk ² Interaction of individual pixels, willq _i And (3) withk _i Is subjected to matrix product calculation and is executedsoftmaxNormalization to obtain attention weight vectorThe formula is as follows:

，

wherein, representing a matrix multiplication calculation; attention weight vector numberjThe individual elements represent the firstiIndividual pixels and their->The first in the neighborhoodjPaired affinities of individual pixels; then, the invention is fromx ^v Middle extraction positioniIs->Feature vectors in the neighborhood, denoted +.>Represents the firstiOf individual pixelsk ² The value of the nearest neighbor; finally, the invention is based on the attention scoreA _i Aggregationv _i To capture the firstiLocal context of each pixel and reconstructing its characterization to obtainThe calculation process is expressed as follows:

；

the global strengthening module aims at capturing the identifying overall information of the vehicle, and has the structure that:

with characteristic diagramsIs an input to the global enhancement module, wherein,C、H、Wthe number, the height and the width of the channels respectively represent the characteristic diagram; obtained by a deforming operation and a fully-connected layerxQuery matrix->，

；

The first of the matrixiRow of linesRepresent the firstiA query vector of individual pixels; in order to disperse the attention score at a target pixel into multiple windows, the present invention distributes the attention score along the spatial dimensionxEvenly divided into->A window, wherein,handwthe height and width of a window respectively; applying a deforming operation and a full connection layer to the feature map of each window to obtainMKey matrix of individual windows->：

，

Wherein, the firstjThe key matrix of each window is，N=h*wFor the size of the window, the linear transformation operations of all windows share the same weight;K _j each column of (a) is a firstjA key vector in a window;

will beQ _i And (3) withK ^T _j Matrix multiplication is performed to obtain a target pixeliAnd the firstjPaired affinity vectors between pixels within a windowI.e.

；

Wherein, representing a matrix multiplication; first, thejPaired affinity matrix for each window with respect to all target pixelsBy means ofQAnd (3) withK ^T _j Matrix multiplication is performed to obtain:

，

wherein, R _j each row of the pixel array is a target pixel and the first row is a target pixeljPair-wise affinities between pixels within the windows; then, the invention is thatR _j Is performed in the column direction of (a)softmaxNormalization operates to obtain the attention score of the pixels of the window at each target pixel, which is formulated as:

；

first, thejAttention matrix of individual windowsEach row of (a) represents a target pixel and the first rowjDependency of all pixels in the window;

by calculation ofMThe attention score of each window at each target pixel results inMAttention matrix of individual windowsThe method comprises the steps of carrying out a first treatment on the surface of the This isMThe matrices are simultaneously calculated as:

，

wherein, softmaxthe operation is performed in the last dimension; to capture a globally significant remote connection of a target pixel, one wouldMThe attention matrixes are spliced into a matrix along a column axisAnd execute thereonL1_normNormalization to obtain a distance dependent intensified attention matrix +.>The calculation formula is as follows:

；

L1_normaggregating the enhanced remote connections from the global receptive field; similar to the calculation of key matrix, the invention is applied toxIs obtained by applying a deforming operation and a full connection layer to the feature map of each windowMValue matrix of individual windows，

，

Wherein, the parameters of the linear transformation operation of all windows are shared; at the futureMThe value matrices of the windows are spliced together to form a value matrixThen, use matrixA ^＇＇ Pair matrixVWeighted summation is performed to reconstruct a representation of the features:

；

reconstructed featuresSThe captured global context strengthens some meaningful remote dependencies with low relevance;

finally, the invention will matrixDeformation into tensor->And adds it to the input feature map to calculate the output feature map of the global augmentation moduleF ^＇ The calculation process is as follows:

；

wherein, GELUthe units of the gaussian error line are indicated,BNrepresenting a batch normalization operation; the module distributes attention to each window and constructs an enhanced global context representation by adopting cross-window interaction operation, thereby improving the capability of learning the whole information of the vehicle by the network.

Further, in the local refinement module, each pixel is closest to itk ² Calculation of the pairwise affinities of individual pixels and reconstruction of all pixels can be accomplished byunfoldOperation and matrix multiplication of tensors; first, willx ^q Morphing to obtain a query tensorThe tensor hasHWA number of queries, each query having a size of 1×CThe method comprises the steps of carrying out a first treatment on the surface of the At the same time atx ^k A kernel of the sizek*kAnd step size of 1unfoldOperates to decimate around each pixelk ² The individual keys are deformed to obtain the key tensor +.>Wherein the key corresponding to the nearest neighbor of each pixel is selected by onek ² ×CIs stored in a matrix; each representation isWith a pixel nearest to itk ² Attention weight tensor for paired affinities of individual pixels>Is byQAndK ^T matrix multiplication of (a)softmaxThe normalization operation results from:

，

wherein, a certain pixel and the method thereofThe pair affinity of the pixels in the neighborhood is 1×k ² Vector representations of (a); next, inx ^v A kernel of the sizek*kAnd step size of 1unfoldOperates to extract each pixelk ² The value corresponding to the nearest neighbor is deformed to obtain a value tensor +.>Wherein the nearest neighbor value of each pixel is calculated by onek ² ×CIs stored in a matrix; finally, the weight vector of each pixel is used for the surroundingk ² The values corresponding to the individual pixels are weighted and summed to obtain all reconstructed pixels +.>The calculation process is expressed as follows:

；

the calculation process realizes the interaction between each pixel and the nearest neighbor pixel, and captures rich detail information.

Tensor is to be tensedx ^＇ Remodelling intoAnd adds it to the original feature map, pair-wise addingPost feature map executionBNAndGELUoperating to obtain final output characteristic diagramF ^＇ The method comprises the following steps:

，

the local refinement module captures the context of the target pixel about its nearest neighbor, and the weight of the local refinement module is generated through interaction of the target pixel and its nearest neighbor, so that rich correlation information between pixels can be fully utilized, and different visual modes of different spatial positions can be adapted.

Further, the three branches each employ a global averaging pooling operation and dimension reduction module to generate a feature representation of the input vehicle image.

Further, for any feature map of the branch output, a global averaging pooling operation is used to obtain a 2048-dimensional feature vector, and then a dimension reduction module consisting of 1*1 convolution, BN and relu activation functions is used to further compress its dimensions to 256 dimensions.

Further, the 256-dimensional feature vector is used for the computation of the triplet loss, and is converted into a full-connected layer with the number of output neurons being the number of vehicles in the training set for the computation of the cross entropy loss.

Further, the cross entropy loss calculation formula is as follows:

，

wherein, Nthe number of vehicles in the training set is shown,ya real identity tag representing an image input to the network,p _i is the input image belonging to the firstiProbability of vehicle.

Further, the triplet loss calculation formula is as follows:

，

wherein, αis control ofAnd->The margin of the distance difference is over-parametered,f _a ⁱ⁽⁾ 、f _p ⁱ⁽⁾ 、f _n ^j() features extracted from the anchor point, positive sample, negative sample, respectively.

Furthermore, the invention adds the cross entropy loss and the triplet loss of the three branches to obtain the final loss, and the total loss calculation formula is as follows:

，

wherein, Nrepresenting the number of branches.

The invention has the technical effects that:

compared with the prior art, the local refinement and global reinforcement network for vehicle re-identification uses the local refinement module and the global reinforcement module to learn the identified local features and global features of the vehicle so as to cope with challenges in vehicle re-identification. Wherein the local refinement module is directed to learning a refined local representation that captures rich correlation information between neighboring pixels through interactions of the target pixel with its nearest pixels; the global emphasis module aims at learning a reinforced global representation that first distracts the target pixels into individual windows to emphasize important remote dependencies within the region, and then aggregates globally significant remote connections by cross-window interactions.

Drawings

FIG. 1 is a diagram of a local refinement and global reinforcement network architecture for vehicle re-identification in accordance with the present invention;

FIG. 2 is a block diagram of a locally refined module of the present invention;

FIG. 3 is a block diagram of a global enhancement module of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings of the specification.

Example 1:

as shown in fig. 1, in a local refinement and global reinforcement network for vehicle re-recognition according to this embodiment, a residual block before res_conv4_2 of res net-50 is used as a backbone for feature extraction, and a subsequent part of the res_conv4_1 residual block is divided into three branches: GL Branch, GS Branch, and LR Branch, and removing the downsampling operation of the res_conv5_1 residual block of the three branches to provide a larger spatial view;

applying a local refinement module after the res_conv5 layer of LR Branch to learn a refined local representation of the vehicle and dividing the output feature map of the module into two parts in the horizontal direction to facilitate further learning of the module;

wherein the local refinement module aims at capturing discriminative local information of the vehicle, and utilizes rich related information contained between adjacent pixels to refine the local representation through interaction of the target pixel and its nearest pixel, and the structure thereof is shown in figure 2, and is provided with a feature mapIs an input to the module which, among other things,C、H、Wthe number, height and width of the channels respectively representing the feature map; using an output channel number of 3CIs convolved with 1*1 to obtainxQuery tensorKey tensor->Sum tensor->：/>；

，

wherein, representing a matrix multiplication calculation; attention weight vector numberjThe individual elements represent the firstiIndividual pixels and their->The first in the neighborhoodjPaired affinities of individual pixels; then, the invention is fromx ^v Middle extraction positioniIs->Feature vectors in the neighborhood, denoted +.>Represents the firstiOf individual pixelsk ² The value of the nearest neighbor; finally, the invention is based on the attention scoreA _i Aggregationv _i To capture the firstiLocal context of each pixel and reconstructing its characterization, resulting inThe calculation process is expressed as follows:

；

the computation process gathers rich relevant information between the target pixel and its nearest neighbors, which captures a refined local context compared to local self-attention.

Each pixel being nearest to itk ² Calculation of the pairwise affinities of individual pixels and reconstruction of all pixels can be accomplished byunfoldOperation and matrix multiplication of tensors; first, willx ^q Morphing to obtain a query tensorThe tensor hasHWA number of queries, each query having a size of 1×CThe method comprises the steps of carrying out a first treatment on the surface of the At the same time atx ^k A kernel of the sizek*kAnd step size of 1unfoldOperates to decimate around each pixelk ² The individual keys are deformed to obtain the key tensor +.>Wherein the key corresponding to the nearest neighbor of each pixel is selected by onek ² ×CIs stored in a matrix; representing each pixel as closest to itk ² Attention weight tensor for paired affinities of individual pixels>Is byQAndK ^T matrix multiplication of (a)softmaxThe normalization operation results from:

，

；

Tensor is to be tensedx ^＇ Remodelling intoAnd adds it to the original feature map, performs the following operations on the added feature mapBNAndGELUoperating to obtain final output characteristic diagramF ^＇ The method comprises the following steps:

，

the local refinement modules are similar to a normal convolution in that they both capture the context of the target pixel with respect to its nearest neighbors. But the weights of the normal convolution are static and lack adaptability. The weight of the local refinement module is generated through interaction between the target pixel and the nearest neighbor of the target pixel, so that rich correlation information among pixels can be fully utilized, and the local refinement module is dynamic and can adapt to different visual modes of different spatial positions;

the global enhancement module aims at capturing the discriminative overall information of the vehicle, and the structure is as shown in fig. 3, wherein important remote dependence in a window is emphasized by window segmentation of a key vector and a value vector, and then global significant remote connection is obtained by cross-window interaction, so that global representation is enhanced.

With characteristic diagramsIs an input to the global enhancement module, wherein,C、H、Wthe number, the height and the width of the channels respectively represent the characteristic diagram; obtained by a deforming operation and a fully-connected layer (FC)xQuery matrix->，

；

The first of the matrixiRow of linesRepresent the firstiA query vector of individual pixels; in order to disperse the attention score at a target pixel into multiple windows, the present invention distributes the attention score along the spatial dimensionxEvenly divided into->A window, wherein,handwrespectively are provided withIs the height and width of a window. Applying a deforming operation and a full connection layer to the feature map of each window to obtainMKey matrix of individual windows->。

，

；

，

wherein, R _j each line of the image is a target imageElement and the firstjPair-wise affinities between pixels within the windows; then, the invention is thatR _j Is performed in the column direction of (a)softmaxNormalization operates to obtain the attention score of the pixels of the window at each target pixel, which is formulated as:

；

first, thejAttention matrix of individual windowsEach row of (a) represents a target pixel and the first rowjDependency of all pixels in the window; the independent calculation of the attention score for the pixels within each window can emphasize significant distance dependence compared to full self-attention.

，

；

，

；

wherein, GELUthe units of the gaussian error line are indicated,BNrepresenting a batch normalization operation; the module distracts the individual windows and builds an enhanced global aspect using cross-window interactionsThe following shows that the capability of the network to learn the overall information of the vehicle is improved.

The three branches each employ a global averaging pooling operation and dimension reduction module to generate a feature representation of the input vehicle image. For any feature map (sub-feature map) of the branch output, a global averaging pooling operation is used to obtain a 2048-dimensional feature vector, which is then further dimension-compressed to 256 dimensions using a dimension reduction module consisting of 1*1 convolution, BN and relu activation functions. The 256-dimensional feature vector is used for the computation of the triplet loss and is converted to a full-connected layer of the number of output neurons as the number of vehicles in the training set for the computation of the cross entropy loss. All the 256-dimensional feature vectors output by the branches are spliced together to be embedded as the features of the input image in the test stage.

In order to prevent model overfitting and improve network recognition capability, the present invention uses cross entropy loss and triplet loss, which are widely used in re-recognition tasks, as the loss function of the present invention. Wherein cross entropy loss is used for classification, and triplet loss is used as a loss function for training stage metric learning.

Cross entropy loss is often used as a loss calculation for classification problems, which mainly measures the difference between the true probability distribution and the predicted probability distribution. In order to improve the prediction effect of the model, the invention reduces the value of the cross entropy as much as possible. Cross entropy often is equal tosoftmaxThe combination of the two components can be used in combination,softmaxthe output result can be mapped into a probability distribution of multiple classifications such that the sum of the predictive probabilities for each classification is 1, while the cross entropy is used to calculate the loss. The calculation formula of the cross entropy loss is as follows:

，

Triplet loss for the three inputs of the set anchor point, positive sample and negative sample, its goal is to minimize the distance between the anchor point and the positive sample with the same identity, and maximize the distance between the anchor point and the negative sample with a different identity. When the two inputs are very similar, it can learn a better representation of the two input vectors that differ less, thus distinguishing the details well. Through continuous learning, vehicles with the same ID are finally gathered in the feature space, so that the task of vehicle re-identification is completed. The triplet loss is calculated by the following way:

，

wherein, αis control ofAnd->The margin of the distance difference is over-parametered,f _a ⁱ⁽⁾ 、f _p ⁱ⁽⁾ 、f _n ^j() features extracted from the anchor point, the positive sample and the negative sample respectively; in addition, respectively usemaxAndminthe function gets the hardest positive and negative pairs, i.e. the most distant positive and closest negative pairs.

The invention adds the cross entropy loss and the triplet loss of the three branches to obtain the final loss, and the total loss formula is written as follows:

，

wherein, Nrepresenting the number of branches.

The invention provides a local refinement and global reinforcement network for vehicle re-identification, which learns the identified local and global features of a vehicle through a local refinement module and a global reinforcement module. Wherein the local refinement module captures rich relevant information contained between adjacent pixels through interaction of the target pixel with its nearest neighbors, thereby refining the local representation. The global strengthening module firstly distributes the attention score of the target pixel into each window to emphasize important remote dependence in the region, and then gathers global meaningful remote connection through cross-window interaction, so as to learn the strengthened global representation. The local refinement module and the global reinforcement module are matched with each other, so that the identification local information and the identification overall information of the vehicle can be effectively extracted.

The above embodiments are merely examples of the present invention, and the scope of the present invention is not limited to the above embodiments, and any suitable changes or modifications made by those skilled in the art, which are consistent with the claims of the present invention, shall fall within the scope of the present invention.

Claims

1. A local refinement and global reinforcement network for vehicle re-identification, characterized by: taking the vehicle image as input, the residual block before res_conv4_2 of ResNet-50 is taken as the backbone for feature extraction, and the subsequent part of res_conv4_1 residual block is divided into three branches: GL Branch, GS Branch, and LR Branch, and remove the downsampling operation of the res_conv5_1 residual block of the three branches;

the structure of the local refinement module is as follows:

Is provided withxMiddle (f)iThe query of individual pixels isRepresentingx ^q In positioniFeature vectors at the location; first, theiIndividual pixels +.>The set of keys in the neighborhood is marked +.>Representingx ^k Middle and positioniClosest tok ² Feature vectors for the individual locations;

will beq _i And (3) withk _i Is subjected to matrix product calculation and is executedsoftmaxNormalization to obtain attention weight vectorThe formula is as follows:

，

wherein, representing a matrix multiplication calculation; attention weight vector numberjThe individual elements represent the firstiIndividual pixels and their->The first in the neighborhoodjPaired affinities of individual pixels; then, fromx ^v Middle extraction positioniIs->Feature vectors in the neighborhood, denoted asRepresents the firstiOf individual pixelsk ² The value of the nearest neighbor; finally, according to the attention scoreA _i Aggregationv _i To capture the firstiLocal context of individual pixels and reconstructing its characterization, resulting in +.>The calculation process is expressed as follows:

obtaining a local refinement module output feature map of the vehicle image;

the global strengthening module has the structure that:

；

The first of the matrixiRow of linesRepresent the firstiA query vector of individual pixels; will be along the spatial dimensionxEvenly divided into->A window, wherein,handwthe height and width of a window respectively; applying a deforming operation and a full connection layer to the feature map of each window to obtainMKey matrix of individual windows->：

，

；

Wherein, representing a matrix multiplication; first, thejPaired affinity matrix of each window for all target pixels>By means ofQAnd (3) withK ^T _j Matrix multiplication is performed to obtain:

，

wherein, R _j each row of the pixel array is a target pixel and the first row is a target pixeljPair-wise affinities between pixels within the windows; then, atR _j Is performed in the column direction of (a)softmaxNormalization operates to obtain the attention score of the pixels of the window at each target pixel, which is formulated as:

；

，

wherein, softmaxthe operation is performed in the last dimension; will beMThe attention matrixes are spliced into a matrix along a column axisAnd execute thereonL1_normNormalization is carried out to obtain the attention moment of remote dependence strengtheningMatrix->The calculation formula is as follows:

；

for a pair ofxIs obtained by applying a deforming operation and a full connection layer to the feature map of each windowMValue matrix of individual windows，

，

；

finally, matrix is formedDeformation into tensor->And adds it to the input feature map to calculate the output feature map of the global augmentation moduleF ^＇ The calculation process is as follows:

；

wherein, GELUthe units of the gaussian error line are indicated,BNrepresenting a batch normalization operation; and obtaining a global strengthening module output characteristic diagram of the vehicle image.

2. The local refinement and global reinforcement network for vehicle re-identification of claim 1, wherein: in the local refinement module, each pixel is nearest to itk ² Computation of pairwise affinities for individual pixels and reconstruction pass of all pixelsunfoldOperation and matrix multiplication of tensors; first, willx ^q Morphing to obtain a query tensorThe tensor hasHWA number of queries, each query having a size of 1×CThe method comprises the steps of carrying out a first treatment on the surface of the At the same time atx ^k A kernel of the sizek*kAnd step size of 1unfoldOperates to decimate around each pixelk ² The individual keys are deformed to obtain the key tensor +.>Wherein the key corresponding to the nearest neighbor of each pixel is selected by onek ² ×CIs stored in a matrix; representing each pixel as closest to itk ² Attention weight tensor for paired affinities of individual pixels>Is byQAndK ^T matrix multiplication of (a)softmaxThe normalization operation results from:

，

；

。

3. the local refinement and global reinforcement network for vehicle re-identification according to claim 1 or 2, characterized in that: the three branches each employ a global averaging pooling operation and dimension reduction module to generate a feature representation of the input vehicle image.

4. A local refinement and global reinforcement network for vehicle re-identification as claimed in claim 3, wherein: for any feature map of the branch output, a global averaging pooling operation is used to obtain a 2048-dimensional feature vector, which is then further dimension-compressed to 256 dimensions using a dimension reduction module consisting of 1*1 convolution, BN and relu activation functions.

5. The local refinement and global reinforcement network for vehicle re-identification of claim 4, wherein: the 256-dimensional feature vector is used for the computation of the triplet loss and is converted to a full-connected layer of the number of output neurons as the number of vehicles in the training set for the computation of the cross entropy loss.

6. The local refinement and global reinforcement network for vehicle re-identification of claim 5, wherein: the cross entropy loss calculation formula is as follows:

，

7. The local refinement and global reinforcement network for vehicle re-identification of claim 5, wherein: the triplet loss calculation formula is as follows:

，

8. The local refinement and global reinforcement network for vehicle re-identification of claim 5, wherein: adding the cross entropy loss and the triplet loss of the three branches to obtain a final loss, wherein the total loss is calculated as follows:

，

wherein, Nthe number of branches is indicated by the number of branches,L _id representing the cross-entropy loss,L _triplet representing the triplet loss.