CN113392786B

CN113392786B - Cross-domain pedestrian re-identification method based on normalization and feature enhancement

Info

Publication number: CN113392786B
Application number: CN202110689585.2A
Authority: CN
Inventors: 殷光强; 贾召钱; 王文超; 曾宇昊; 王春雨
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2022-04-12
Anticipated expiration: 2041-06-21
Also published as: CN113392786A

Abstract

The invention belongs to the technical field of pedestrian re-identification, and particularly relates to a cross-domain pedestrian re-identification method based on normalization and feature enhancement. According to the technical scheme, on the basis of not using target domain data, domain gaps can be effectively restrained, pedestrian distinguishing characteristics are enhanced, and further the generalization capability of the recognition network model is enhanced; by means of the residual error connection idea, the example normalization can inhibit style difference and prevent information loss, so that the extracted features have domain invariance and the discrimination is kept; spatial information is fused into the channels through the attention unit CAB, and the characteristic weight of each channel is self-adaptively adjusted through constructing the dependency relationship among the channels, so that the pedestrian characteristic is effectively enhanced.

Description

Cross-domain pedestrian re-identification method based on normalization and feature enhancement

Technical Field

The invention belongs to the technical field of pedestrian re-identification, and particularly relates to a cross-domain pedestrian re-identification method based on normalization and feature enhancement.

Background

Cross-domain pedestrian re-identification refers to retrieving a target pedestrian from large-scale image or video data over different domains using computer vision techniques. The ideal cross-domain pedestrian re-recognition model can be trained once and tested at will, namely, the model is trained only by using the collected source domain data, and then the trained model can obtain good re-recognition effect on any other target domain. However, huge domain gaps often exist among data sets, which seriously hinder the generalization of the model from a source domain to a target domain and is also a main reason that the cross-domain pedestrian re-identification performance is difficult to improve.

Because of the inevitable domain differences between different data domains, many advanced re-recognition algorithms perform well when tested on a single data set, but have poor ability to generalize to another data domain. In order to improve the generalization capability of the model as much as possible, a plurality of cross-domain pedestrian re-identification methods appear in recent years, and the model is required to be better adapted to a target domain. The general method is to collect data of a part of target domains, cluster the extracted features by using a certain clustering algorithm to generate pseudo labels, train a model by using the generated pseudo labels, update model parameters, and iterate the steps until convergence. Although many cross-domain pedestrian re-identification methods do effectively improve the generalization capability of the model, the collection of the target domain data is also time-consuming and labor-consuming, and in practical application, the data of the target domain cannot be collected at all.

Specifically, in the cross-domain pedestrian re-identification model, the domain gap of the data set is mainly introduced in the data collection process, such as: differences in collection time can cause differences in image brightness, and differences in collection locations can cause differences in image background. These different stylistic differences make the data distribution in different domains different, which in turn leads to complications in the re-recognition task. Currently, transfer learning is one of the mainstream means for solving the problem of model generalization, specifically, knowledge or patterns learned in a certain field or task are applied to different but related fields or problems, and image style conversion (style transfer) is a transfer learning method in the image field, so that the problem of model generalization caused by image style difference can be effectively solved, and researchers widely apply the transfer learning method to the task of cross-domain pedestrian re-identification. The method of style migration based on generation of a countermeasure network (GAN) requires the use of target domain data during model training, adding additional collection and training costs, and Instance Normalization (IN) is used for style normalization, which performs a form of style normalization to adjust the feature statistics of the network. However, the IN can dilute the information carried by the global statistics of the feature response, obviously, the IN introduced IN the row re-identification task can normalize the image style to inhibit the inter-domain difference, but the process of the IN can lose some judgment information.

To improve the generalization ability of the model, it is also an effective means to enhance pedestrian features using an attention mechanism that enables the model to focus on the region of interest, which is generally divided into spatial attention and channel attention. The spatial attention utilizes the spatial relationship among the features to generate a spatial attention weight so as to position the concerned pedestrian information on the spatial dimension; channel attention is to improve the representational capacity of the network by modeling the dependencies of each channel. Different attention is usually paid to different tasks, researchers need to match the tasks in specific tasks, but simple overlapping use causes certain redundancy and wastes computing resources.

Disclosure of Invention

According to the problems in the prior art, the invention provides a cross-domain pedestrian re-identification method based on normalization and feature enhancement, which can effectively inhibit domain gaps and enhance pedestrian distinguishing features on the basis of not using target domain data, thereby enhancing the generalization capability of a model.

The method is realized by the following technical scheme:

the cross-domain pedestrian re-identification method based on normalization and feature enhancement is characterized by comprising the steps of establishing an identification network model, image feature normalization, image feature recovery and image feature output;

establishing a recognition network model, wherein the establishing of the recognition network model comprises establishing a normalization enhancement module NEM with an Instance normalization unit IN (instant normalization 1), a residual weight training unit CMS and an attention unit CAB, and inserting the normalization enhancement module NEM into a ResNet50 model by taking a ResNet50 model as a backbone network to form a recognition network model;

the image feature normalization comprises the following steps:

s11, extracting pedestrian image features x ∈ R based on ResNet50 model^c×h×w(Note: x, x in this embodiment)₁、x₂Are all shown as figuresImage features), where x is the input feature of the normalized enhancement module NEM, c is the number of channels of the image features, h is the height of the image features, w is the width of the image features, R^c×h×wRepresenting a real number domain space of dimensions c x h x w, x ∈ R^c×h×wRepresenting a vector in a real number domain space with input features x of dimensions c x h x w;

s12, obtaining input characteristic x e R by using example normalization unit IN^c×h×wAnd a variance σ (x) in each channel, and calculates a normalized feature x based on the obtained mean μ (x) and variance σ (x)₁The calculation formula is as follows:

wherein, gamma and beta are learnable parameter vectors, and gamma belongs to Rc and beta belongs to Rc, which means that gamma and beta are vectors in a real number domain space with c dimension; the initial values of gamma and beta elements are respectively set to be 1 and 0, and then are automatically updated in the training process;

the image feature recovery comprises the following steps:

s21, the residual error weight training unit CMS is facilitated according to the normalized feature x₁Learning a residual weight W_rNamely, the following steps are provided:

W_r＝sigmoid(mean(conv(x₁)))

wherein conv (-) represents convolution, mean (-) represents global mean, sigmoid (-) represents activation function;

s22, based on the residual weight W_rFusing input features x and normalized features x₁And recovering discrimination information lost by the image characteristics due to style normalization, wherein a fusion formula is as follows:

x₂＝W_r×x₁+(1-W_r)×x

wherein x is₂For the restored image feature, named restored feature, and x₂∈R^c×h×wExpressed is a recovered feature x₂Is one of real number domain spaces of dimension c x h x wVector quantity;

the image feature output comprises the steps of:

s31, exploring the restored features x using the attention cell CAB₂The relevance between different channels in the channel and the self-adaptive extraction of the attention weight W of the channel_cNamely, the following steps are provided:

W_c＝ca(x₂)

where ca (-) is the attention unit CAB, the channel attention weight W_cMeasures the recovered feature x₂The importance of the information of each channel;

s32, attention weighting W by channel_cFor the recovered feature x₂Filtering is carried out to enhance the characterization capability of the pedestrian characteristics, namely:

f＝(W_c+1)×x₂

wherein f is the output characteristic of the normalization and enhancement module NEM.

Further, the ResNet50 model comprises a Res1 unit, a Res2 unit, a Res3 unit, a Res4 unit, a Res5 unit and a Head unit which are sequentially connected in a communication mode, and normalization enhancement modules NEM are inserted into the output ends of the Res2 unit, the Res3 unit, the Res4 unit and the Res5 unit respectively.

Further, the method also includes introducing a NEM loss function into the normalized enhancement module NEM at the output end of the Res5 unit, that is, the image feature output of the normalized enhancement module NEM further includes the following steps:

s33, respectively calculating the central loss C of the input feature x_xAnd output characteristic f center loss C_fIn order to measure the in-class dispersion of the input feature x and the output feature f in the feature space, the calculation formula is as follows:

wherein, c^x _j∈R^dRepresenting the class center of the jth pedestrian in the input feature x; c. C^f _j∈R^dRepresenting the class center of the jth pedestrian in the output characteristic f; n represents the total number of pedestrians in the data set, m represents the total number of features of the jth pedestrian, and x_jiI-th feature representing j-th pedestrian in input feature x, d representing the dimension of each feature, R^dA real number domain space representing d dimensions, i.e. c^x _jAnd c^f _jA vector in a real number domain space that is both d-dimensional;

s34, based on the center loss c_fAnd c_xEstablishing an NEM loss function, wherein the NEM loss function is as follows:

wherein L is_NEMThe loss values calculated for input feature x and output feature f of NEM 5.

Further, in the step S11, x ∈ R^c×h×wThe carried characteristic information comprises a style and a shape; the style comprises an imaging style of the image and a clothing style of the pedestrian, and the shape is the contour shape of the pedestrian in the image.

Further, in the step S31, a channel attention weight W is obtained_cThe method comprises the following steps:

s311, along the recovered feature x₂Performing maximum pooling and average pooling on the channel dimension to obtain two 1 × h × w two-dimensional matrixes, and recovering the characteristic x₂Respectively carrying out element-by-element multiplication with the two 1 xhxw two-dimensional matrixes so as to respectively introduce the spatial information respectively corresponding to the two 1 xhxw two-dimensional matrixes into the recovered characteristic x₂In the channel of (a);

s312, respectively performing maximum pooling and average pooling on the features corresponding to the introduced spatial information along the spatial dimension to generate two spatial aggregation masks F₁And F₂And F is₁∈R^c×1×1，F₂∈R^c×1×1(ii) a Wherein R is^c×1×1Representing a real number domain space of dimensions c × 1 × 1, F₁∈R^c×1×1And F₂∈R^c×1×1Is represented by F₁And F₂Vectors in real number domain space, all of dimensions c × 1 × 1;

s313, concat operation is carried out on the two space aggregation masks, and the result obtained through the concat operation is sequentially subjected to convolution and sigmoid operation and is fused to obtain the final channel attention weight W_c。

Further, the spatial information includes global information and saliency information on a space corresponding to the 1 × h × w two-dimensional matrix.

The beneficial effect that this technical scheme brought:

1) according to the technical scheme, on the basis of not using target domain data, domain gaps can be effectively restrained, pedestrian distinguishing characteristics are enhanced, and further the generalization capability of the recognition network model is enhanced; by means of the residual error connection idea, the example normalization can inhibit style difference and prevent information loss, so that the extracted features have domain invariance and the discrimination is kept; spatial information is fused into the channels through the attention unit CAB, and the characteristic weight of each channel is self-adaptively adjusted through constructing the dependency relationship among the channels, so that the pedestrian characteristic is effectively enhanced.

2) According to the technical scheme, NEM loss function constraint is introduced to identify the invariant features of the network model learning domain, so that the distance in the feature class is reduced, and the feature distribution is optimized.

Drawings

The foregoing and following detailed description of the invention will be apparent when read in conjunction with the following drawings, in which:

FIG. 1 is a block diagram of the overall structure of a pedestrian re-identification model as described herein;

fig. 2 is a block diagram of the structure of the normalization enhancement module NEM;

FIG. 3 is a block diagram of the attention unit CAB;

fig. 4 is a comparison graph of the effect of different combinations of insertions of the normalized enhancement module NEM in the ResNet 50;

Detailed Description

The technical solutions for achieving the objects of the present invention are further illustrated by the following specific examples, and it should be noted that the technical solutions claimed in the present invention include, but are not limited to, the following examples.

Example 1

The embodiment discloses a cross-domain pedestrian re-identification method based on normalization and feature enhancement, and as a basic implementation scheme of the invention, the method comprises the steps of establishing an identification network model, normalizing image features, recovering image features and outputting the image features.

Establishing a recognition network model, including establishing a normalized enhancement module NEM with an example normalization unit IN (namely IN IN FIG. 2), a residual weight training unit CMS (namely CMS IN FIG. 2) and an attention unit CAB (namely CA IN FIG. 2) as shown IN FIG. 2, and taking a ResNet50 model as a backbone network, and inserting the normalized enhancement module NEM into a ResNet50 model to form the recognition network model.

Normalizing the image features, namely normalizing the style of the features by calculating the mean and variance in each channel of the image features, so that the style difference between different domains can be inhibited, and the method specifically comprises the following steps:

s11, extracting pedestrian image features x ∈ R based on ResNet50 model^c×h×wThe carried characteristic information, wherein x is the input characteristic of the normalization enhancement module NEM, c is the channel number of the image characteristic, h is the channel height of the image characteristic, w is the channel width of the image characteristic, R^c×h×wRepresenting a real number domain space of dimensions c x h x w, x ∈ R^c×h×wIt is shown that the input feature x is a vector in a real number domain space of dimensions c × h × w, and x ∈ R^c×h×wThe carried characteristic information comprises a style and a shape; the style comprises the imaging style of the image and the clothing style of the pedestrian, and the shape is the contour shape of the pedestrian in the image;

wherein μ (x) and σ (x) represent an average value and a variance value calculated over a spatial dimension (h × w) of the image feature, respectively; both γ and β are learnable parameter vectors, and γ ∈ R^c、β∈R^cIndicating that both γ and β are vectors in a real number domain space that is c-dimensional; the initial values of gamma and beta elements are respectively set to be 1 and 0, and then the values are automatically updated in the training process, specifically, the gamma initializes the vector with 1 and the beta initializes the vector with 0, the values of the gamma and the beta automatically change according to the gradient of back propagation in the training process, the function of the gamma and the beta is to ensure that the original learned characteristics are kept after each data is normalized, and simultaneously, the normalization operation and the accelerated training can be completed.

Although image feature normalization helps reduce style variation resulting in inter-domain gaps, if the style itself contains pedestrian re-recognition discrimination information, it may also result in significant information loss while eliminating the style variation. For example, clothing of pedestrians is important re-identification discrimination information, the texture of clothing fabric obviously belongs to one of styles, and when the style is inhibited, the discrimination of the feature is weakened, so that the image feature normalization can inhibit style difference and prevent information loss by means of a residual error connection idea, and meanwhile, the extracted feature has domain invariance and maintains discrimination. The image feature recovery method is specifically realized by image feature recovery, and comprises the following steps:

W_r＝sigmoid(mean(conv(x₁)))

where conv (-) represents convolution, mean (-) represents global mean, sigmoid (-) represents activation function, i.e., feature x is first normalized₁Passing through a convolution layer with convolution kernel size of 3 × 3 × c, step length of 2 and output channel of 1, normalizing feature x₁The contained information is compressed in the dimensions of space and channels, then the average value is calculated in each channel, and the space is further compressedInformation, finally obtaining residual weight W between 0 and 1 after sigmoid mapping_rI.e. residual weights W_r∈R¹；

x₂＝W_r×x₁+(1-W_r)×x

wherein x is₂For the restored image feature, named restored feature, and x₂∈R^c×h×wExpressed is a recovered feature x₂Is a vector in a real number domain space of dimensions c x h x w.

Since spatial information is gradually compressed and pedestrian-related information is gradually shifted to channel dimensions in the feature extraction process (referring to the overall feature extraction process, more than one link) by the ResNet50 module, it is necessary to enhance pedestrian features by means of channel attention, that is, the image feature output includes the following steps:

s31, exploring the restored features x using the attention cell CAB₂The relevance between different channels enables the attention to be focused on the most meaningful part of the pedestrian image, and the attention weight W of the channel is extracted in a self-adaptive manner_cNamely, the following steps are provided:

W_c＝ca(x₂)

where ca (-) is the attention unit CAB, the channel attention weight W_cMeasure x₂The importance of the information of each channel;

f＝(W_c+1)×x₂

According to the technical scheme, on the basis of not using target domain data, domain gaps can be effectively restrained, pedestrian distinguishing characteristics are enhanced, and further the generalization capability of the recognition network model is enhanced; by means of the residual error connection idea, the example normalization can inhibit style difference and prevent information loss, so that the extracted features have domain invariance and the discrimination is kept; spatial information is fused into the channels through the attention unit CAB, and the characteristic weight of each channel is self-adaptively adjusted through constructing the dependency relationship among the channels, so that the pedestrian characteristic is effectively enhanced.

Example 2

The embodiment discloses a cross-domain pedestrian re-identification method based on normalization and feature enhancement, which is a preferred implementation scheme of the invention, namely in the embodiment 1, a ResNet50 model comprises a Resl unit, a Res2 unit, a Res3 unit, a Res4 unit, a Res5 unit and a Head unit which are sequentially connected in a communication mode, a normalization enhancement module NEM is inserted after each Res unit or part of Res units of a ResNet50 model, and the normalization enhancement module NEM can respectively enhance features in relevant stages, so that the overall effect is good. In the ResNet50 model, the features obtained by the operation of the Res1 unit are too shallow and basically do not contain semantic information such as styles, and the normalized enhancement module NEM is inserted after the Res1 unit to play a role in feature enhancement, so that the complexity of the model is further increased, and therefore, the normalized enhancement module NEM is not inserted after the Res1 unit in the process of designing the identified network model.

Specifically, the effect of the NEM is best after the Res units are inserted into the NEM, and verification can be carried out according to experiments. As shown in fig. 4: NEM23 indicates the insertion of normalized-enhancement-module NEM at the output of Res2 and Res3 cells, NEM234 indicates the insertion of normalized-enhancement-module NEM at the output of Res2, Res3 and Res4 cells, NEM2345 indicates the insertion of normalized-enhancement-module NEM at the output of Res2, Res3, Res4 and Res5 cells, NEM345 indicates the insertion of normalized-enhancement-module NEM at the output of Res3, Res4 and Res5 cells, NEM45 indicates the insertion of normalized-enhancement-module NEM at the output of Res4 and Res5 cells, respectively; in addition, M, D, MS in the abscissa represents three pedestrian re-identification common data sets of Market1501, DukeMTMC-reiD and MSMT17 respectively; M-D represents training a model on a Market1501, then carrying out pedestrian re-identification test on the trained model on a DukeMTMC-reiD, and so on, wherein D-M, MS-M and MS-D are the same principle; the ordinate represents the mAP accuracy. As can be seen from fig. 4, NEM2345 has the best effect and can effectively enhance the cross-domain pedestrian re-recognition performance of the model, so that normalization enhancing modules NEM, such as NEM2, NEM3, NEM4 and NEM5 shown in fig. 1, are inserted into the output ends of Res2 unit, Res3 unit, Res4 unit and Res5 unit, respectively. Thus, the network model identification method in the technical scheme comprises the following working procedures: the Res2 unit of the ResNet50 model extracts the image characteristics of the original image, and the image characteristics are normalized, restored and output through NEM2, and then the image characteristics are sent to: and extracting deeper features of pedestrians from Res3 units of the ResNet50 model, and continuing image feature normalization, image feature recovery and image feature output processing on the features of Res3 units by NEM3, and so on, wherein the Res4 units and NEM4 and Res5 units and NEM5 are the same principle until the Head unit of the ResNet50 model is finally output.

Example 3

This example discloses a cross-domain pedestrian re-identification method based on normalization and feature enhancement, which is a preferred implementation of the present invention, that is, in example 2, in order to promote better clustering characteristics of features, the method further includes introducing a NEM loss function into the normalization and enhancement module NEM at the output end of the Res5 unit, and constraining the normalization and enhancement module NEM (i.e., NEM5), where it is expected that features extracted by NEM5 have better domain invariance and discriminability, and therefore, the image feature output of the normalization and enhancement module NEM further includes the following steps:

wherein, c^x _j∈R^dRepresenting the class center of the jth pedestrian in the input feature x; c. C^f _j∈R^dRepresenting the class center of the jth pedestrian in the output characteristic f; n represents the total number of pedestrians in the data set, m represents the total number of features of the jth pedestrian, and x_jiI-th feature representing j-th pedestrian in input feature x, d representing the dimension of each feature, R^dA real number domain space representing d dimensions, i.e. c^x _jAnd c^f _jVectors in real number domain space, both d-dimensional;

According to the technical scheme, NEM loss function constraint is introduced to identify the invariant features of the network model learning domain, so that the distance in the feature class is reduced, and the feature distribution is optimized.

Example 4

This example discloses a cross-domain pedestrian re-identification method based on normalization and feature enhancement, which is a preferred embodiment of the present invention, that is, in step S31 of example 1, as shown in fig. 3, a channel attention weight W is obtained_cThe method comprises the following steps:

s312, in order to effectively calculate the channel attention, the relative phase is requiredCompressing the spatial dimension of the features, generally using average pooling for aggregation of spatial information to focus more on global information, however, maximum pooling can also obtain unique features about pedestrians to infer more detailed information on the channel, and thus, performing maximum pooling and average pooling on the features corresponding to the introduced spatial information along the spatial dimension to generate two spatial aggregation masks F₁And F₂And F is₁∈R^c×1×1，F₂∈R^c ^×1×1The masks respectively focus on global information and unique information about pedestrians in the feature map; wherein R is^c×h×1Representing a real number domain space of dimensions c × 1 × 1, F₁∈R^c×1×1And F₂∈R^c×1×1Is represented by F₁And F₂Vectors in real number domain space, all of dimensions c × 1 × 1;

s313, concat (vector splicing) operation is carried out on the two space aggregation masks, and the result obtained through the concat operation is sequentially subjected to convolution and sigmoid operation and is fused to obtain the final channel attention weight W_c。

Claims

1. The cross-domain pedestrian re-identification method based on normalization and feature enhancement is characterized by comprising the steps of establishing an identification network model, image feature normalization, image feature recovery and image feature output;

the identification network model building method comprises the steps of building a normalization enhancement module NEM with an instance normalization unit IN, a residual weight training unit CMS and an attention unit CAB, using a ResNet50 model as a main network, and inserting the normalization enhancement module NEM into a ResNet50 model to form an identification network model; the ResNet50 model comprises a Res1 unit, a Res2 unit, a Res3 unit, a Res4 unit, a Res5 unit and a Head unit which are sequentially connected in a communication mode, and normalization enhancement modules NEM are inserted into the output ends of the Res2 unit, the Res3 unit, the Res4 unit and the Res5 unit respectively;

the image feature normalization comprises the following steps:

s11, extracting pedestrian image features x ∈ R based on ResNet50 model^c×h×wCarry aboutWherein x is the input feature of the normalized enhancement module NEM, c is the number of channels of the image feature, h is the height of the image feature, w is the width of the image feature, R^c×h×wRepresenting a real number domain space of dimensions c x h x w, x ∈ R^c×h×wRepresenting a vector in a real number domain space with input features x of dimensions c x h x w;

where γ and β are both learnable parameter vectors, and γ ∈ R^c、β∈R^cIndicating that both γ and β are vectors in a real number domain space that is c-dimensional; the initial values of gamma and beta elements are respectively set to be 1 and 0, and then are automatically updated in the training process;

the image feature recovery comprises the following steps:

W_r＝sigmoid(mean(conv(x₁)))

x₂＝W_r×x₁+(1-W_r)×x

wherein x is₂For the restored image feature, named restored feature, and x₂∈R^c×h×wExpressed is a recovered feature x₂Is one in a real number domain space of dimension c x h x wA vector number;

the image feature output comprises the steps of:

W_c＝ca(x₂)

f＝(W_c+1)×x₂

wherein f is the output characteristic of the normalization enhancement module NEM;

2. The cross-domain pedestrian re-identification method based on normalization and feature enhancement as claimed in claim 1, wherein: in the step S11, x ∈ R^c×h×wThe carried characteristic information comprises a style and a shape; the style comprises an imaging style of the image and a clothing style of the pedestrian, and the shape is the contour shape of the pedestrian in the image.

3. The cross-domain pedestrian re-identification method based on normalization and feature enhancement as claimed in claim 1, wherein in the step S31, the channel attention weight W is obtained_cThe method comprises the following steps:

s313, polymerizing the two spacesPerforming concat operation on the mask, sequentially performing convolution and sigmoid operation on the result obtained by the concat operation, and fusing to obtain the final channel attention weight W_c。

4. The cross-domain pedestrian re-identification method based on normalization and feature enhancement as claimed in claim 3, wherein the spatial information includes global information and saliency information on a corresponding space of a1 x h x w two-dimensional matrix.