CN114882529A

CN114882529A - Double-branch cross-domain pedestrian re-identification method based on attention calibration

Info

Publication number: CN114882529A
Application number: CN202210477742.8A
Authority: CN
Inventors: 黄盼; 朱松豪; 梁志伟
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-08-09

Abstract

The invention discloses a double-branch cross-domain pedestrian re-identification method based on attention calibration, which is characterized in that an attention calibration module is used for improving a reference network of ResNet-50, domain invariant feature complementary branches and domain specific feature global branches are introduced, domain invariant feature information and domain specific feature information are subjected to linear weighting to obtain fusion features, and the fusion features are normalized to obtain final features of a model. The reference network can fully extract the channel information and the space information of the features, and solves the problem of feature misalignment. The number of model parameters can be reduced by introducing domain specific feature global branches, training deviation is corrected by combining a reference network, and meanwhile, the complexity of the model is reduced, so that the domain specific features can be fully extracted by the model. By introducing the domain invariant feature supplementary branch, the source domain complementary information obtained from the calibration features of the specific domain is enriched, and the domain specific features extracted by the domain specific feature global branch are combined, so that the model has stronger generalization capability.

Description

Double-branch cross-domain pedestrian re-identification method based on attention calibration

Technical Field

The invention relates to a pedestrian re-identification method, in particular to a double-branch cross-domain pedestrian re-identification method based on attention calibration.

Background

As one of the most attractive research tasks for computer vision communities, pedestrian re-identification aims to determine whether identical pedestrians are contained in views from non-overlapping cameras. The pedestrian re-identification model plays an important role in various monitoring applications, such as pedestrian retrieval and public safety event detection, so that attention of many scholars is paid in the past years, however, after the source domain is trained, the performance of the pedestrian re-identification model is often obviously reduced when the pedestrian re-identification model is applied to a new domain data set. Most of the existing research focuses on fully supervised pedestrian re-identification and has made encouraging progress, however, these methods are based on independent co-distributed datasets, and in order to improve this problem, students have introduced unsupervised domain adaptation, which works by learning models from labeled source domain and unlabeled target domain data through image translation, feature alignment or multitask learning, but such methods require a large amount of target domain data to function better. In addition to the unsupervised concept, scholars have also proposed domain generalization techniques for the real world that use multiple source domains for model training that perform well in invisible target domains without the need for updating. From a search of the prior art literature, Choi et al have found that the MetaBIN algorithm was proposed and used to meta-train a batch instance normalization network by combining Batch Normalization (BN) and Instance Normalization (IN), but this approach requires very careful handling of the relationship between batch normalization and instance normalization. Song et al learns domain invariant features using feature mapping and meta learning to fully extract domain invariant and domain specific features. He et al, which utilize knowledge distillation to train multi-source domain data through student models and teacher models to migrate the knowledge learned by the teacher model to student models with relatively weak learning capabilities to enhance the generalization capabilities of the student models, are often overly complex and eliminate pattern differences only from a given source domain, thus lacking the ability to sufficiently attenuate new patterns of unseen domains. Wang et al propose a large-scale synthetic pedestrian re-identification dataset named RandPerson, and knowledge learned from synthetic data can be well applied to real-world datasets, which alleviates the lack of the current situation of the pedestrian re-identification dataset to some extent.

In summary, in the cross-domain pedestrian re-identification, how to better improve the generalization capability of the model to the data set is an urgent problem to be solved by those skilled in the art.

Disclosure of Invention

In order to solve the problems, the invention provides a double-branch cross-domain pedestrian re-identification method based on attention calibration.

In order to achieve the purpose, the invention is realized by the following technical scheme:

the invention relates to a double-branch cross-domain pedestrian re-identification method based on attention calibration, which comprises the following steps of:

s1: constructing an end-to-end pedestrian re-identification network based on an attention calibration module;

s2: generating a feature map by using a pedestrian re-identification network, and creating a domain invariant feature supplementary branch and a domain specific feature global branch based on the feature map to obtain a model frame for identifying and retrieving a cross-domain pedestrian object;

s3: utilizing the domain invariant features to supplement the domain invariant feature information of the branch learning source domain data set, and utilizing the domain specific features to learn the domain specific feature information of the source domain data set through global branches;

s4: carrying out linear weighting on the domain invariant feature information and the domain specific feature information to obtain fusion features, and carrying out normalization on the fusion features to obtain final features of the model;

s5: and calculating the similarity between the features by using metric learning based on the final features of the model, sequencing according to the similarity, and selecting a plurality of features arranged in the front as a model retrieval result.

The invention is further improved in that: the pedestrian re-identification network in the step S1 includes three volume blocks and two attention calibration modules, and the original input features sequentially pass through the two volume blocks, the first attention calibration module, the volume block, and the second attention calibration module to obtain output features.

The invention is further improved in that: the attention calibration module in S1 comprises two convolution layers, a feature calibration FCBN layer and a Sigmoid function, and input features f _i After sequentially passing through the first convolutional layer, the feature calibration FCBN layer, the second convolutional layer, and the Sigmoid layer, an attention mask M for attention calibration is obtained, and a formula thereof is defined as:

M＝Sigmoid(Conv ₂ (FCBN(Conv ₁ (f _i ))))

wherein the characteristic calibration FCBN layer is represented as:

FCBN(X _(n,c,h,w) )＝x _(n.c,h,w) +ω·G

the input features of the FCBN layer adopt the output features X of the first convolution layer, n represents the number of batches, c represents the number of channels, h and w represent the height and width of an input picture, X represents the output features of the FCBN layer, omega is a learnable weight vector, and G is statistical information of the features X;

adding a feature map of an attention mask M to an input feature f _i To emphasize the distinguishing features, to obtain an attention-calibrated output feature F _O The expression is as follows:

F _O ＝f _i ·M+f _i 。

the invention is further improved in that: the domain-invariant feature supplementary branch and the domain-specific feature global branch in S2 are derived by a first attention calibration module and a second attention calibration module, respectively.

The invention is further improved in that: output characteristic X of domain-invariant characteristic supplementary branch in S3 _(DIFSB) Is represented as follows:

X _(DIFSB) ＝Conv _a2 (IN(Relu(FCBN _a2 (Conv _a1 (FCBN _a1 (x _s ))))))

wherein x is _s Features output from the first attention calibration module, IN is a style normalization layer, Conv _a1 A first convolution layer, Conv, representing the branch _a2 A second convolutional layer, FCBN, representing the branch _a1 、FCBN _a2 The first characteristic calibration FCBN layer and the second characteristic calibration FCBN layer representing the branch.

The invention is further improved in that: output feature X of domain specific feature global branch in S3 _(DSFGB) Is represented as follows:

X _(DSFGB) ＝Conv _b (FCBN _b2 (Bottleneck(FCBN _b1 (x _t ))))

wherein x is _t Features, Conv, output from the second attention calibration module _b Convolution layers representing the branches for reducing the number of characteristic channels c, FCBN _b1 、FCBN _b2 The first characteristic calibration FCBN layer and the second characteristic calibration FCBN layer representing the branch.

The invention is further improved in that: s4 normalizes the fusion features to obtain an output feature X:

X＝Normalize(a×X _(DSFGB) +b×X _(DIFSB) )

wherein a and b represent weight coefficients, and normaize represents L2 norm.

The invention has the beneficial effects that: 1. the pedestrian re-identification network is based on a ResNet-50 neural network framework, and is combined with an attention calibration module to improve a ResNet-50 reference network, the reference network can fully extract channel information and space information of features, solve the problem of feature misalignment, locate distinguishable local features and reduce noise features so as to better extract feature information beneficial to pedestrian re-identification.

2. According to the invention, a domain specific feature global branch is introduced, the branch comprises two feature calibration FCBN layers and a Bottleneck layer, the feature calibration FCBN layer can relieve the training deviation in a standard BN layer and improve the generalization performance, the Bottleneck layer can reduce the model parameter quantity and can effectively train data and extract features, the training deviation is corrected by combining a reference network, partial high-frequency noise in the learned features is eliminated, meanwhile, the model complexity is reduced, and the model can fully extract domain specific features.

3. According to the method, a domain invariant supplementary branch is introduced, source domain complementary information is obtained from calibration characteristics of a specific field IN combination with a reference network, domain specific information is filtered by means of a style normalization IN layer, domain invariant characteristics are obtained, and domain specific characteristics extracted by a global specific branch are combined, so that the model has strong generalization capability.

Drawings

Fig. 1 is an overall frame diagram of the present invention.

FIG. 2 is a schematic representation of the experimental results of the process of the present invention.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

As shown in fig. 1, the present invention is a double-branch cross-domain pedestrian re-identification method based on attention calibration, and the specific steps are as follows:

s1: constructing an end-to-end pedestrian re-identification network based on an attention calibration module; the specific operation is as follows: and selecting the first three volume blocks of ResNet-50, and adding an attention calibration module after the second volume block and the third volume block to form a reference network. The attention calibration module comprises two convolution layers, namely a first convolution layer and a second convolution layer, a feature calibration FCBN layer and a Sigmoid function, and input features f _i Sequentially passing through a first convolution layer, a characteristic calibration FCBN layer, a second convolution layer and a Sigmoid activation function layer to improve the nonlinear expression capability of the model and obtain an attention mask M for attention calibration, wherein the attention mask M is expressed as:

M＝Sigmoid(Conv ₂ (FCBN(Conv ₁ (f _i ))))

where both Conv are 1 × 1 convolutions, Conv in the formula ₁ A first convolution layer for attention calibration module for feature dimension reduction, Conv ₂ A second convolution layer of the module is calibrated for attention for upscaling the feature.

The characteristic calibration FCBN layer is improved from the standard BN layer, and the formula is as follows:

FCBN(X _1(n,c,h,w) )＝x _(n.c,h,w) +ω·G

wherein x is _(n,c,h,w) Representing input characteristics x which are output characteristics of the first convolution layer, wherein n represents the number of batches, c represents the number of channels, and h and w represent the height and width of an input picture; where ω is a learnable weight vector, ω ∈ R ^1×C×1×1 R represents a real number, G is a characteristic X ₁ Can have various shapes, such as G epsilon R ^N×1×H×W Or G ∈ R ^N×1×H×W In this embodiment, G is set by default to the average value μ for the small lot calculation (μ ∈ R) ^N×C×1×1 ). Broadcasting two characteristics to the same shape, performing dot product operation, representing dot product operation, and obtaining output characteristic X through FCBN ₁ 。

Feature map addition to input features f for attention mask M _i To emphasize the distinguishing features, to obtain an attention-calibrated output feature F _O The formula is as follows:

F _O ＝f _i ·M+f _i

wherein the operator represents a dot product operation.

S2: a domain invariant feature supplementary branch (diffb) is drawn from the first attention calibration block of the constructed reference network and a Domain Specific Feature Global Branch (DSFGB) is drawn from the second attention calibration block.

The domain invariant feature complement branch includes two feature calibration FCBN layers, two convolution layers, a ReLU function, and a style normalization IN layer. The input features of the branch enter the convolutional layer after passing through a first feature calibration FCBN layer, and then pass through a second feature calibration FCBN layer, a ReLU activation function, a style normalization IN layer and convolutional layer output domain invariant features.

The domain specific feature global branch comprises two feature calibration FCBN layers, a Bottleneck layer and a convolutional layer; the input features of the branch enter a Bottleneck layer after passing through a first feature calibration FCBN layer, then pass through a second feature calibration FCBN layer, and finally pass through a convolutional layer to output domain specific features.

Output characteristic X of field invariant characteristic supplementary branch (DIFSB) _(DIFSB) The formula is as follows:

X _(DIFSB) ＝Conv _a2 (IN(Relu(FCBN _a2 (Conv _a1 (FCBN _a1 (x _s ))))))

wherein x is _s Features output from the first attention alignment Module, IN is the style normalization layer, Conv is the convolution layer, Conv _a1 A first convolution layer representing the branch for reducing the height h and width w, Conv of the picture _a2 A second convolutional layer representing the branch for reducing the number of characteristic channels c, FCBN _a1 、FCBN _a2 The first characteristic calibration FCBN layer and the second characteristic calibration FCBN layer representing the branch.

Output feature X of domain specific feature global branch _(DSFGB) The formula is as follows:

X _(DSFGB) ＝Conv _b (FCBN _b2 (Bottleneck(FCBN _b1 (x _t ))))

wherein x is _t Features output from the second attention-alignment module, the tapped convolution layer Conv _b For reducing the number of characteristic channels c, FCBN _b1 、FCBN _b2 The first characteristic calibration FCBN layer and the second characteristic calibration FCBN layer representing the branch.

Wherein the style normalization IN layer normalizes the features by:

wherein x _in Representing the input features, X, from the output of the ReLU activation function, ready to enter the IN layer _in Representing its output characteristics. Mu epsilon of R ^c And delta epsilon to R ^c Mean and standard deviation, respectively, for small batch calculations, γ ∈ R ^c And beta. epsilon.R ^c Is an affine parameter, and ε > 0 is avoided with a constant of zero. In training, μ and δ are estimated by a moving average operation.

S4: performing linear weighting on two feature information learned by a Domain Invariant Feature Supplementary Branch (DIFSB) and a Domain Specific Feature Global Branch (DSFGB) to obtain a fusion feature, and normalizing the fusion feature by using an L2 norm to obtain a final feature X of the model:

X＝Normalize(a×X _(DSFGB) +b×X _(DIFSB) )

wherein X _(DSFGB) Output features, X, representing domain-specific feature global branches _(DIFSB) The output characteristics of the domain invariant characteristic supplementary branch are represented, a and b represent weight coefficients which are respectively set to 0.9 and 0.1 in the embodiment, and norm represents L2 norm;

s5: and calculating the similarity between the features by using metric learning for the final features, sequencing according to the similarity, and taking the top ten sequenced pictures as the result of model retrieval. The calculation can be carried out by adopting a contrast loss function and a triple loss function.

In the implementation, similarity among characteristics is calculated by using batch OHEM triple loss, and an OHEM triple loss function L of the similarity is calculated _tri The formula is as follows:

wherein

From a small batch of samples, the samples were,

the jth picture representing the ith person. hardest negative represents a person that looks very similar but not the same person, hardest positive represents a person that looks different but is the same person, Δ represents a network parameter, f _Δ Is a feature extractor, m is the margin, and D (·,) represents the similarity between features.

In this example, the training batch size is set to 64 and the test batch size is set to 128. we use the classic SGD optimizer training model and the learning rate of the backbone is 0.0005. As shown in fig. 2, the first column and the fourth column in the drawing are pictures input into the network, the second column and the fifth column are visual activation features after passing through the first attention calibration module, and the third column and the sixth column output visual results of the activation features for the second attention calibration module. As can be seen, the method of the invention can better focus on the identifiability characteristics of pedestrians, such as hairstyles, bags, corner profiles of clothes and the like.

Claims

1. A double-branch cross-domain pedestrian re-identification method based on attention calibration is characterized by comprising the following steps: the method comprises the following steps:

s5: similarity between features is calculated using metric learning based on the model final features and ranked by similarity.

2. The method for re-identifying the double-branch cross-domain pedestrian based on the attention calibration as claimed in claim 1, wherein: the pedestrian re-identification network in the step S1 includes three volume blocks and two attention calibration modules, and the original input features sequentially pass through the two volume blocks, the first attention calibration module, the volume block, and the second attention calibration module to obtain output features.

3. The method for re-identifying the double-branch cross-domain pedestrian based on the attention calibration as claimed in claim 2, wherein: attention in S1The force calibration module comprises two convolution layers, a characteristic calibration FCBN layer and a Sigmoid function, and input characteristics f _i After sequentially passing through the first convolutional layer, the feature calibration FCBN layer, the second convolutional layer, and the Sigmoid layer, an attention mask M for attention calibration is obtained, and a formula thereof is defined as:

M＝Sigmoid(Conv ₂ (FCBN(Conv ₁ (f _i ))))

wherein the characteristic calibration FCBN layer is represented as:

FCBN(X _(n,c,h,w) )＝x _(n.c,h,w) +ω·G

F _O ＝f _i ·M+f _i 。

4. the method for re-identifying the double-branch cross-domain pedestrian based on the attention calibration as claimed in claim 1, wherein: the domain-invariant feature supplementary branch and the domain-specific feature global branch in S2 are derived by a first attention calibration module and a second attention calibration module, respectively.

5. The method for re-identifying the double-branch cross-domain pedestrian based on the attention calibration as claimed in claim 1, wherein: output characteristic X of domain-invariant characteristic supplementary branch in S3 _(DIFSB) Is represented as follows:

X _(DIFSB) ＝Conv _a2 (IN(Relu(FCBN _a2 (Conv _a1 (FCBN _a1 (x _s ))))))

wherein x is _s Features output from the first attention calibration module, IN is a style normalization layer, Conv _a1 A first convolution layer, Conv, representing the branch _a2 A second convolution layer, FCBN, representing the branch _a1 、FCBN _a2 The first signature indicating the branch calibrates the FCBN layer and the second signature calibrates the FCBN layer.

6. The method for re-identifying the double-branch cross-domain pedestrian based on the attention calibration as claimed in claim 1, wherein: output feature X of domain specific feature global branch in S3 _(DSFGB) Is represented as follows:

X _(DSFGB) ＝Conv _b (FCBN _b2 (Bottleneck(FCBN _b1 (x _t ))))

7. The method for re-identifying the double-branch cross-domain pedestrian based on the attention calibration as claimed in claim 1, wherein: s4 normalizes the fusion features to obtain an output feature X:

X＝Normalize(a×X _(DSFGB) +b×X _(DIFSB) )

wherein a and b represent weight coefficients, and normaize represents L2 norm.