CN112766353B

CN112766353B - Double-branch vehicle re-identification method for strengthening local attention

Info

Publication number: CN112766353B
Application number: CN202110040859.5A
Authority: CN
Inventors: 张小瑞; 陈旋; 孙伟; 宋爱国
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2023-07-21
Anticipated expiration: 2041-01-13
Also published as: CN112766353A

Abstract

The invention discloses a double-branch vehicle re-identification method for strengthening local attention, which comprises the following steps: (1) Pre-training the ResNet50 network and setting the last downsampling step length to 1; (2) constructing upper branches by utilizing Layer3 and Layer4 of ResNet 50; (3) Uniformly dividing the features proposed by Layer4 into three parts along the longitudinal direction, performing random discarding operation on pixels of each part, and constructing a lower branch; (4) training a dual-branch model with triplets and focus loss; (5) Extracting image characteristics of the to-be-queried and image library by using the trained network model; (6) And calculating the similarity between the vehicle image to be queried and the image library, and returning the vehicle image with the front similarity in the image library. The invention provides a double-branch vehicle re-identification method, wherein the upper branch extracts global features of a vehicle, the lower branch strengthens the attention to the local features, increases the distinguishing degree and the identification degree of image features of the vehicle, is suitable for vehicle re-identification crossing cameras in complex traffic scenes, and improves the vehicle re-identification efficiency.

Description

Double-branch vehicle re-identification method for strengthening local attention

Technical Field

The present invention relates to a vehicle re-recognition method, and more particularly, to a method for re-recognizing a two-branch vehicle with enhanced local attention.

Background

In recent years, vehicle re-identification is receiving more and more attention, and the technology can be widely applied to the fields of video monitoring, intelligent transportation and the like, especially when license plates are shielded, removed and even forged, the vehicle re-identification becomes the only way for traffic departments to find escaping vehicles. Vehicle re-identification can be understood as a sub-problem of image retrieval for the purpose of detecting and tracking a target vehicle across camera devices, i.e. retrieving a monitored vehicle image across devices given the vehicle image.

Vehicle re-identification is a challenging computer vision task. Because the same vehicle has a large variation in visual appearance under different camera viewpoints under uncontrolled illumination, viewing angles, low resolution and complex background conditions, vehicles of the same model have similar visual appearances, and have the same color and similar model characteristics, and different vehicles belonging to the same model have obvious inter-class similarity. In order to solve the above challenges, most of the current works adopt a deep learning method to automatically extract the image features of the vehicle. Many methods extract global features of the vehicle through a convolution network, but the global features are easy to cause misjudgment due to the fact that the vehicle is blocked when the attitude change is large. Some studies have therefore begun to extract local features by labeling key points, local locations, etc., but these methods require tremendous labeling effort.

Disclosure of Invention

The invention aims to: the invention aims to provide a double-branch vehicle re-identification method for strengthening local attention, which not only extracts global features, but also focuses on the local features by a simple and effective method, solves the problems of need of early labeling and huge calculation amount, and realizes efficient and remarkable vehicle re-identification.

The technical scheme is as follows: the invention provides a double-branch vehicle re-identification method for strengthening local attention, which comprises the following steps:

(1) Pre-training the ResNet50 network, and setting the last downsampling step length of the ResNet50 network to be 1;

(2) Constructing upper branches by utilizing Layer3 and Layer4 of ResNet 50;

(3) Uniformly dividing the features extracted by the Layer4 into three parts along the longitudinal direction, randomly discarding each part, and constructing a lower branch;

(4) Training a two-branch model by using the triplets and the focus loss;

(5) Extracting vehicle image features in a to-be-queried and gallery by using a trained network model;

(6) And calculating the similarity between the vehicle image to be queried and the image library, and returning the vehicle image with the front similarity in the image library.

Further, in the step (2), the characteristics obtained by Layer3 and Layer4 are represented by X3 and X4, respectively; performing global average pooling and global maximum pooling on X3 to respectively obtain X3-avg and X3-max, and mutually superposing the X3-avg and the X3-max and sending the X3-avg and the X3-max into a full connection layer; and carrying out global average pooling and global maximum pooling on the X4 to respectively obtain X4-avg and X4-max, and sending the X4-avg and the X4-max into a full connection layer after being mutually overlapped.

Further, in the step (3), the features extracted by Layer4 are uniformly divided into top, middle, bottom parts along the longitudinal direction, each part is multiplied by a Mask matrix with the same size as the Mask matrix, each part is multiplied to discard a region, and the height and the width of the discarded region can be adjusted according to the training effect; and (5) carrying out global maximum pooling on the top, middle, bottom subjected to discarding treatment respectively, and then sending the top, middle, bottom subjected to discarding treatment into a full connection layer.

Further, in step (4), the focal loss L _foc The calculation formula is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,m represents the number of vehicle categories in the image library, q _i Representing the probability that an input picture predicts that the picture belongs to each i-type (i epsilon {1,2,3,..M }) vehicle after the input picture obtains features through a network, wherein gamma is a super parameter greater than 0, and y is the super parameter of the input picture _i Representing the actual tag of the incoming sample vehicle.

Further, in step (4), three paired images are input each time, including a fixed image, positive sample posivep belonging to the same vehicle as a, negative sample negotiven belonging to a different vehicle as a, and triplet loss L _tri The calculation formula is as follows:

L _tri ＝max(d _a,p -d _a,n +margin,0)

d _a,p euclidean distance d calculated by obtaining feature vector for a and p through network _a,n And obtaining Euclidean distance calculated by the feature vector for the a and the n through a network, wherein the margin is a training threshold parameter set according to actual requirements.

Further, in the step (4), the calculation formula of the joint loss L of the triplet loss and the focus loss is:

L＝α(L _foc 1+L _foc 2+L _foc 3+L _foc 4+L _foc 5)+β(L _tri 1+L _tri 2+L _tri 3+L _tri 4)

L _foc 1、L _foc 2、L _foc 3、L _foc 4、L _foc 5 is the focus loss calculated by the corresponding feature, L _tri 1、L _tri 2、L _tri 3、L _tri And 4 is the triplet loss obtained by calculation of the corresponding characteristics, and alpha and beta are weighting coefficients of the focus loss and the triplet loss.

Further, in step (6), the similarity between the query and the gallery vehicle image is calculated by using a cosine distance, and the cosine distance c has a calculation formula:

wherein feature is ₁ To query image features ₂ Vehicle image features for image library,represents matrix point multiplication operation, each element is multiplied one by one, the term "two" means two norms;

the gallery vehicle images are arranged according to the similarity, and the most similar vehicle images are returned.

The beneficial effects are that: compared with the prior art, the invention has the following remarkable advantages:

(1) And extracting multiple features of the vehicle image by using the double-branch model. The global features with the characterization are extracted, and the model focuses on the local features;

(2) Each part of each batch of images is randomly discarded, so that advance labeling and huge calculation are not needed, and the attention of the model to local characteristics is enhanced;

(3) The combined loss optimization network of the difficult triplet loss and the focus loss is adopted, so that the model pays more attention to the difficult-to-separate samples in the training process, and the discrimination capability of the model to the difficult-to-separate samples is enhanced.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of a dual-branch network structure according to the present invention;

fig. 3 is a schematic diagram of a random discard model of the invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

The invention adopts the double-branch network, the upper branch extracts the global feature of the vehicle image, the overall structure information of the vehicle is extracted, the lower branch blocks the feature and then adopts a random discarding strategy, the local attention feature of the vehicle is more concerned, and the double-branch model has stronger discrimination capability. The combined loss of the triplet and the focus loss is adopted, so that the model is more concerned with the difficult-to-separate sample in the training process, and the identification capability of the model on the difficult-to-separate sample is enhanced.

As shown in fig. 1, which is a flowchart of the present invention, the detailed steps are as follows:

(1) The ResNet50 network is pre-trained with its final downsampling step size set to 1.

Taking the ImageNet data set as a training data set, training a ResNet50 network, setting the final downsampling step length of the ResNet50 network to be 1, and enabling the trained network to have initial parameters.

(2) Upper branches are built using Layer3, layer4 of the res net50 as shown in the upper half of the box of fig. 2.

The training lot size is p×k, P being the type of vehicle in each lot, K being the number of pictures each vehicle contains. P×k vehicle images are input at a time, and the size of the images is cut out to 384×320 pixels. The characteristics obtained by ResNet50Layer3 and Layer4 are denoted by X3 and X4, respectively. Global Average Pooling (GAP) and Global Maximum Pooling (GMP) are carried out on X3 to obtain X3-avg and X3-max respectively, and superposition operation is carried out on X3-avg and X3-max to obtain a characteristic f ₀ [P×K,1024]Then two full connection layers are arranged behind the feature f, and the feature f is obtained in sequence ₁ 、f ₂ The sizes are [ P x K,512 ]]、[P×K,M]M represents the number of vehicle categories in the image library. Similarly, X4 is subjected to the same operation to obtain X4-avg and X4-max, and superposition operation is performed to obtain a characteristic f ₃ [P×K,2048]Then feeding the mixture into a full connection layer to obtain a feature f ₄ 、f ₅ The sizes are [ P x K,512 ]]、[P×K,M]. X3-avg, X3-max, X4-avg, X4-max are used to calculate the triplet loss L during training _tri 1、L _tri 2、L _tri 3、L _tri 4；f ₂ 、f ₅ Respectively for calculating the focus loss L _foc 1、L _foc 2. The upper branch can obtain more vehicle representation features by utilizing the features of Layer3 and Layer4, and the global vehicle representation is enhanced by fusing the multi-scale pooling features of GAP and GMP, so that the global feature of the whole vehicle image is extracted.

(3) The features extracted by Layer4 are divided into three parts uniformly in the longitudinal direction, each part is randomly discarded, and the lower branch is built up as shown in the lower half of the frame of fig. 2.

Considering that most of the vehicle images are more of the top of the vehicle, the middle part is the vehicle body and the bottom contains more chassis and tire information, the features extracted by Layer4 are longitudinally and uniformly divided into three parts, namely top, middle, bottom, and the sizes of each part are [ P multiplied by K,2048,8,20 ]]For each part there is a size of [ P x K,2048,8,20 ]]Mask and of (2)The Mask is equivalent to a random matrix, and each part of the same batch of images after multiplication discards the same area, and the height and width of the discarded area are different according to tasks. The specific method is that a part of the Mask is set to 0, and the rest is set to 1.top, middle, bottom and its corresponding Mask are to randomly discard the same region from each of the three parts of the training image of the same batch, as shown in fig. 3, two feature maps of the same batch are uniformly divided into three parts, each part of the same region is discarded, and the cross parts in the figures indicate that the same region is discarded. Global Maximum Pooling (GMP) is then applied to each part of the features, resulting in a size of p×k,2048, respectively]Feature f of (2) ₆ 、f ₇ 、f ₈ They are subjected to dimension reduction to obtain a size of [ PxK, 1024 ]]Feature f of (2) ₉ 、f ₁₀ 、f ₁₁ Finally, adding full connection layers to obtain the product with the size of [ P multiplied by K, M ]]Feature f of (2) ₁₂ 、f ₁₃ 、f ₁₄ Respectively, for calculating the focus loss L _foc 3、L _foc 4、L _foc 5. The lower branch enhances learning of regional features by dividing the features into three parts, and mass discarding of the same region of each part, in turn enhancing attention learning for that part.

(4) Two-branch models were trained with difficult triplets and focus loss.

By focal loss L _foc And triplet loss L _tri The combined penalty L optimizes the dual-branch network.

Focal loss L _foc The calculation formula is as follows:

wherein M represents the number of vehicle categories in the image library, q _i Representing the probability that an input picture predicts that the picture belongs to each i-type (i epsilon {1,2,3,..M }) vehicle after the input picture obtains features through a network, wherein gamma is a super parameter greater than 0, and y is the super parameter of the input picture _i Representing the actual tag of the incoming sample vehicle.

By (1-q) _i ) ^γ Items to amplify difficult samplesThe weight in the total loss is lost, and a difficult sample is a sample which is difficult to separate and easy to separate. For simple samples, i.e. q _i The larger the sample, the more the modulation factor (1-q _i ) ^γ The smaller; conversely, for difficult samples, i.e. q _i The smaller the sample, the larger the modulation factor. In this way, in training, the loss of the difficult sample is amplified, the model can pay more attention to the difficult sample, the problem that a large number of simple samples reduce the overall loss of the model in vehicle re-identification is solved, and the judgment capability of the model on the difficult sample is improved.

Three paired images are input at a time, and each paired image comprises a fixed image (anchor) a, a positive sample (positive) p belonging to the same vehicle with a, and a negative sample (negative) n belonging to different vehicles with a. Triplet loss L _tri The calculation formula is as follows:

L _tri ＝max(d _a,p -d _a,n +margin,0)

The combined loss L is calculated as:

L _foc 1、L _foc 2、L _foc 3、L _foc 4、L _foc 5 are each f ₂ 、f ₅ 、f ₁₂ 、f ₁₃ 、f ₁₄ Focal loss obtained by characteristic calculation, L _tri 1、L _tri 2、L _tri 3、L _tri And 4 is the triplet loss obtained by the characteristic calculation of the X3-avg, the X3-max, the X4-avg and the X4-max, and alpha and beta are weighting coefficients of the focus loss and the triplet loss.

(5) And extracting the vehicle image characteristics of the to-be-queried and gallery by using the trained network model.

Obtaining to-be-checked by using the trained double-branch network modelQuery image feature ₁ Features associated with image library image features ₂ And obtaining the image characteristics of the to-be-queried and image library.

(6) Calculating the similarity between the query and the image library image by using the cosine distance, and arranging according to the similarity

The cosine distance c has the following calculation formula:

wherein feature is ₁ To query image features ₂ For the image characteristics of the vehicle in the image library, the matrix point multiplication operation is represented, the elements are multiplied one by one, and the I represents a two-norm.

The images of the image library are arranged according to the similarity, and the most similar vehicle images are returned.

In summary, the vehicle re-identification method disclosed by the invention uses the double-branch model, the upper branch extracts global features of the whole vehicle, and the lower branch improves the attention of the model to the local features by using a simple and efficient random discarding method. The combined loss optimization network of the triplet loss and the focus loss is adopted, so that the model is more focused on the difficult-to-separate samples in the training process, and the discrimination capability of the model on the difficult-to-separate samples is enhanced.

The descriptions of the embodiments of the present invention that are not related to the present invention are known in the art, and may be implemented with reference to the known art.

Claims

1. A method for re-identifying a two-branch vehicle for enhancing local attention, comprising the steps of:

(2) Constructing upper branches by utilizing Layer3 and Layer4 of ResNet 50;

(4) Training a two-branch model by using the triplets and the focus loss;

(6) Calculating the similarity between the vehicle image to be queried and the vehicle image in the image library, and returning the vehicle image with the front similarity in the image library;

in the step (2), the characteristics obtained by Layer3 and Layer4 are represented by X3 and X4, respectively; performing global average pooling and global maximum pooling on X3 to obtain X3-avg and X3-max respectively, and performing superposition operation on the X3-avg and the X3-max to obtain a characteristic f ₀ Then two full connection layers are sent to obtain the characteristic f ₁ 、f ₂ The method comprises the steps of carrying out a first treatment on the surface of the Similarly, performing global average pooling and global maximum pooling on X4 to obtain X4-avg and X4-max respectively, and performing superposition operation on the X4-avg and the X4-max to obtain a characteristic f ₃ Then two full connection layers are fed to obtain the characteristic f ₄ 、f ₅ 。

2. The method for re-recognition of a two-branch vehicle with enhanced local attention as recited in claim 1, wherein: in the step (3), the features extracted by Layer4 are uniformly divided into top, middle, bottom parts along the longitudinal direction, each part is multiplied by a Mask matrix with the same size as the Mask matrix, each part is multiplied and then a region is discarded, and the height and the width of the discarded region are adjusted according to the training effect; global maximization is carried out on top, middle, bottom which is discarded respectively to obtain the characteristic f ₆ 、f ₇ 、f ₈ They are subjected to dimension reduction to obtain a feature f ₉ 、f ₁₀ 、f ₁₁ Finally, adding full connection layers to obtain the characteristic f ₁₂ 、f ₁₃ 、f ₁₄ 。

3. The method for re-recognition of a two-branch vehicle with enhanced local attention as recited in claim 1, wherein: in the step (4), a focal loss L _foc The calculation formula is as follows:

4. The method for re-recognition of a two-branch vehicle with enhanced local attention as recited in claim 1, wherein: in the step (4), three paired images are input each time, wherein each paired image comprises a fixed image, positive samples positive p belonging to the same vehicle with a, negative samples negative n belonging to different vehicles with a, and triplet loss L _tri The calculation formula is as follows:

L _tri ＝max(d _a,p -d _a,n +margin,0)

5. The method for re-recognition of a two-branch vehicle with enhanced local attention as recited in claim 2, wherein: in the step (4), the calculation formula of the joint loss L of the triplet loss and the focus loss is as follows:

L _foc 1、L _foc 2、L _foc 3、L _foc 4、L _foc 5 are respectively f ₂ 、f ₅ 、f ₁₂ 、f ₁₃ 、f ₁₄ Focal loss obtained by characteristic calculation, L _tri 1、L _tri 2、L _tri 3、L _tri And 4 is the triplet loss obtained by characteristic calculation of X3-avg, X3-max, X4-avg and X4-max, and alpha and beta are weighting coefficients of the focus loss and the triplet loss.

6. The method for re-recognition of a two-branch vehicle with enhanced local attention as recited in claim 1, wherein: in the step (6), the similarity between the query and the gallery vehicle image is calculated by using a cosine distance, and a cosine distance c calculation formula is as follows:

wherein feature is ₁ To query image features ₂ For the vehicle image characteristics of the image library, the method represents matrix point multiplication operation, elements are multiplied one by one, and the I represents a two-norm;