CN115050048A

CN115050048A - Cross-modal pedestrian re-identification method based on local detail features

Info

Publication number: CN115050048A
Application number: CN202210604338.2A
Authority: CN
Inventors: 产思贤; 朱锦校; 吴周检; 林沛
Original assignee: Hangzhou Pixel Technology Co ltd
Current assignee: Hangzhou Pixel Technology Co ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-09-13
Anticipated expiration: 2042-05-25
Also published as: CN115050048B

Abstract

The invention discloses a cross-modal pedestrian re-identification method based on local detail features, which can guide a network to fully mine the detail information of pedestrians. The proposed APMG can generate the weight of each heatmap according to the pedestrian posture, and the heatmap is fused to obtain the mask to extract the pedestrian detail features. Because the APMG lacks the characteristics of the lower half part of the body, the MC module provided fuses the APMG and the PCB to jointly extract the local characteristic representation of the pedestrian. Further, the proposed WIPA module can interact context information between local features, and suppress the context information in slice features using position information contained in the mask. The two local feature extraction modes complement each other to make up for the deficiency. According to the pedestrian cross-mode re-identification method and device, global and local detail features are combined to be used as the representation of the pedestrian, and good effects are achieved in the cross-mode re-identification task of the pedestrian.

Description

Cross-modal pedestrian re-identification method based on local detail features

Technical Field

The invention relates to the technical field of image processing, in particular to a cross-modal pedestrian re-identification method based on local detail features.

Background

Given a query image and an image dataset, gallery, from different modalities, respectively, the purpose of cross-modality pedestrian re-identification is to match images in gallery that are identical to the identity of the query. Due to its importance in the field of public safety, cross-modal pedestrian re-identification has become a popular problem in the field of re-identification. Due to variations in spectrum, pedestrian pose, and photographic viewing angle, it is a challenging task to fully mine distinctive pedestrian identity.

In order to fully capture pedestrian information, a pedestrian representation method combining local features becomes a common setting in the field of cross-modal pedestrian re-identification. There are three main types of local feature extraction schemes including slicing, pose estimation, and mask filtering. The common slicing method PCB is to uniformly slice the finally output characteristic diagram of the backbone network into a plurality of strips along the vertical direction. From top to bottom, each bar characterizes a different part of the human body. These local features are then constrained using a loss function to focus the network on locally discriminative information. Although slicing can guarantee coverage of all parts of a pedestrian, slicing inevitably introduces extraneous background information and cannot guarantee alignment of local features.

In order to solve the above problem, there is a method such as PGII that positions a pedestrian portion using attitude estimation. A heatmap is generated using the pre-trained pose estimator and local features are extracted using the heatmap as a mask. Attitude estimation can help a network to position joint positions, the problem of feature misalignment is solved, and background information is filtered to a certain extent. However, the estimation effect of pose estimation on pedestrian re-recognition data sets may be inaccurate, which may introduce some background features. In addition, these methods do not distinguish heatmap when extracting local features in the early stage, and lack robustness to pedestrian variations, and may still introduce background information. In order to enhance the robustness of the local features to different pedestrians, the method MPANet extracts the local features by utilizing a deep network model learning mask. But the generated mask hardly focuses on a certain part of the body stably and is lack of the label of the mask, so that the extracted local features are not aligned.

In conclusion, the invention designs a cross-modal pedestrian re-identification method based on local detail features.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention aims to provide a cross-mode pedestrian re-identification method based on local detail characteristics, and three modules are introduced into the prior art, wherein the three modules comprise adaptive body part mask generation module (APMG), Mask Compensation (MC) scheme and weighted intra-part association module (WIPA). The problems of misalignment of background information and characteristics and the like existing in the current local characteristic method are solved.

In order to achieve the purpose, the invention is realized by the following technical scheme: a cross-modal pedestrian re-identification method based on local detail features comprises the following steps:

step 1, reading a SYSU-MM01 data set which comprises images of pedestrians in two modes (normal light and infrared rays), and performing data enhancement on the data set. The training set was divided into uniform batches, each containing 8 images from 8 identities, four for each modality. Inputting a pair of cross-modal images with the same identity into a backbone network Resnet-50, and extracting a global feature map

Step 2, sending the image pair into a posture estimation network GCM to obtain 16 image pairs

Based on the estimated mass of each heatmap on SYSU-MM01, 9 high-quality heatmaps were screened to generate human body part masks.

Step 3, selecting 9 heatmaps and a global feature map F _g Sending into APMG (adaptive body part mask generation module) module. APMG learning F _g The contribution of the upper body parts adaptively generates the weight of each heatmap. The selected heatmaps were divided into two groups, top part and mid part. Downsampling heatmap to F using maximum pooling _g The size of the space of (a). Then, the heatmap is accumulated according to the weight to obtain the weight of each part

Mask is used to divide F _g Extraction of local features of top part and midpart

Step 4, in order to compensate the lower body information lacking in the APMG, an mc (mask compensation) module is proposed. The MC divides the global feature into three pieces in the vertical direction by extracting the local feature from the PCB, and takes the last piece

Characterized as the following characteristics. Then, the local feature vector f is obtained by combining the local feature vector f with the local feature map extracted by the mask to carry out global pooling _local ∈R ^3×C 。

Step 5, f _local And (3) mining the context relationship between intra-part by a weighted intra-part attribute module, and simultaneously inhibiting background information in the low part characteristic. And finally, WIPA, the contribution degree of each part is measured to generate a weight weighting characteristic. Pooled global feature vector f _g And f _local Connected along the pathway as a representation of a pedestrian.

And 6, in order to train the network to accurately capture the pedestrian mode invariant identity feature, the network is trained by using three Loss functions, namely ID Loss, Center Cluster Loss and modification Learning Loss.

And 7, respectively extracting the characteristics of the person in the query set and the galery set, and calculating the similarity between the image in the query set and each image in the galery set. And adopting Euclidean distance between the feature vectors as similarity measurement. And finally, sorting the images in the galery set according to the similarity to obtain a re-identification result.

Preferably, the step 2 specifically comprises: and sending a pair of pedestrian images into an attitude estimation network GCM to obtain heatmaps of 16 different joints of the human body. Pedestrian images are randomly extracted and observed for quality corresponding to heatmap. Finally, 9 of the 9 joints (chest, upper neck, crown, left and right shoulders, left hip, left and right elbows, left wrist) were selected from the 16 heatmaps

Preferably, the APMG in step 3 can adaptively generate a mask to extract refined local features. The input of APMG is selected

And a global feature map F output by the backbone network _g And (4) forming. Specifically, the screened heatmaps were divided into two groups of P representing pedestrians, respectively _top (chest, upper neck, crown of head, left and right shoulders), with P _mid (left hip, left elbow, left wrist). Then, f is mixed _g The incoming weight generation network Gw (-) generates the weight W for each heatmap _{h_map} ∈R ^1×9 . The calculation formula is as follows:

W _{h_map} ＝σ(G _w (GAP(F _g )))

σ (-) denotes sigmoid function, G _w (. cndot.) consists of a convolution with a convolution kernel size of 1. G _w The purpose of (-) is to learn the degree of contribution of heatmap to human body parts based on global features and generate corresponding weights. With generated weights W _{h_map} To P _top And P _mid Weighted summation of heatmaps of the two groups obtains the mask of the corresponding part _top ，

The formula is as follows:

mask _top ＝W _{h_map} [P _top ] ^T Heatmap[P _top ]

mask _mid ＝W _{h_map} [P _mid ] ^T Heatmap[P _mid ]

[ P ] represents an element in the collection and corresponding position to P.

Partitioning global feature graph F with mask _g Obtaining the characteristics F of top part and mid part of pedestrians _{l_top} ，

The division formula of the local features is as follows:

F _{l_top} ＝mask _top ⊙F _g

F _{l_mid} ＝mask _mid ⊙F _g

preferably, the MC is F to compensate for the lack of lower body information of the mask _g Dividing into three parts along vertical direction, and taking the last part of feature diagram F _{l_low} As a representation of the lower body of the pedestrian. A local feature map F of three parts of the pedestrian _{l_low} ，F _{l_top} ，F _{l_mid} Obtaining local characteristics f after global pooling _local ∈R ^3×C 。

Preferably, the WIPA module in step 5 is defined by f _local ∈R ^3×C As an input. A self-attention calculation between local features is first performed. Sending it into three 1 × 1 convolutional layers Q (-), K (-), and V (-), to obtain query and key features with dimension ck, and c _v Value feature of (1). Features obtained from attention calculation are

The self-attention calculation formula is as follows:

h is the head number, by passing F _{l_top} ，F _{l_mid} Including body part information to help suppress slice feature F _{l_low} The background information of (1). Because different parts of the pedestrians have different contribution degrees to the weight recognition task, the network can learn the weight of each local feature to enhance the useful information. Specifically, i set two fully-connected layers and one ReLU layer to learn local feature weights, and the calculation formula is as follows:

will f is mixed _g And

the connections along the pathway result in a final pedestrian characterization.

Preferably, the ID Loss in step 6 is used to train the feature F _{l_low} Ensuring that it contains pedestrian lower body information. Three loss functions joint training f _g And

final characteristics after connection.

The invention has the following beneficial effects: the invention combines the global and local detail characteristics as the representation of the pedestrian, and obtains good effect in the cross-mode pedestrian re-identification task.

Drawings

The invention is described in detail below with reference to the drawings and the detailed description;

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

Referring to fig. 1, the following technical solutions are adopted in the present embodiment: a cross-modal pedestrian re-identification method based on local detail features mainly comprises an adaptive human body part mask generation module (APMG), a mask compensation Module (MC) and an intra-part interactive attention module (WIPA) with the right, and comprises the following steps:

step 1, reading a SYSU-MM01 data set which contains pedestrian images in two modes (normal light and infrared rays), and performing data enhancement on the data set; the training set is divided into a plurality of batches, where each batch includes 8 identities, each identity including 4 infrared images and 4 RGB images. Inputting a pair of cross-modal images with the same identity into a backbone network Resnet-50, and extracting a global feature map

Step 2, sending the pair of pedestrian images into an attitude estimation network GCM to obtain heatmaps of 16 different joints of the human body; pedestrian images are randomly extracted and observed for quality corresponding to heatmap. Finally, 9 joints (chest, upper neck, top of head, left and right shoulders, left hip, left and right elbows, left wrist) were selected from 16 heatmaps

Step 3, screening the obtained product

And a global feature map F output by the backbone network _g And sending the image data into an APMG to obtain masks of different parts of the pedestrian. Specifically, the screened heatmaps are divided into two groups of P which respectively represent pedestrians _top (chest, upper neck, crown of head, left and right shoulders), with P _mid (left hip, left elbow, left wrist). Then, f is mixed _g Incoming weight generating network G _w (. to) generate a weight W for each heatmap _{h_map} ∈R ^1×9 . The calculation formula is as follows:

W _{h_map} ＝σ(G _w (GAP(F _g )))

The formula is as follows:

mask _top ＝W _{h_map} [P _top ] ^T Heatmap[P _top ]

mask _mid ＝W _{h_map} [P _mid ] ^T Heatmap[P _mid ]

[ P ] represents an element in the collection and corresponding position to P.

Partitioning Global feature map F with mask _g Obtaining the characteristics F of top part and mid part of pedestrians _{l_top} ，

The division formula of the local features is as follows:

F _{l_top} ＝mask _top ⊙F _g

F _{l_mid} ＝mask _mid ⊙F _g

and 4, selecting the posture estimation model GCM to identify the structure of the lower half of the pedestrian, wherein the posture estimation model GCM only selects the upper half of the pedestrian to generate masks. To compensate for the lack of mask lower body information, the MC will F _g Dividing into three parts along vertical direction, and taking the last part of feature diagram F _{l_low} As a representation of the lower body of the pedestrian. A local feature map F of three parts of the pedestrian _{l_low} ，F _{l_top} ，F _{l_mid} Obtaining local characteristics f after global pooling _local ∈R ^3×C 。

Step 5, in order to mine local featuresAnd suppress slice feature F _{l_low} The background information included introduces a WIPA module. WIPA with f _local ∈R ^3×C As an input. A self-attention calculation between local features is first performed. Feeding it into three 1X 1 convolutional layers Q (-), K (-), and V (-), respectively obtaining the dimension c _k Query and key features of (a), and dimension of c _v Value feature of (1). Features obtained from attention calculation are

The self-attention calculation formula is:

h is the head number, by passing F _{l_top} ，F _{l_mid} Including body part information to help suppress slice feature F _{l_low} The background information of (1). Because different parts of the pedestrian have different contribution degrees to the re-recognition task, the network can learn the weight of each local feature by self to enhance useful information. Specifically, i set two fully-connected layers and one ReLU layer to learn local feature weights, and the calculation formula is as follows:

will f is mixed _g And

And 6, co-training the network by using three Loss functions of ID Loss, Center Cluster Loss and modification Learning Loss. To ensure local features F _{l_low} The identity information of the lower body of the pedestrian is contained, and the identity information is restrained by IDLoss. Three loss functions are jointly trained for guiding network learning mode-independent features _g And

final characteristics after connection.

And 7, respectively extracting the characteristics of pedestrians in the query and the galery set in a sequential mode, and sequentially carrying out similarity calculation on the query image characteristics and the image characteristics in the galery. And sorting the images in the gallery according to the similarity to obtain a re-identification result.

The specific implementation mode can guide the network to fully excavate the detailed information of the pedestrian. The proposed APMG can generate the weight of each heatmap according to the pedestrian posture, and the heatmap is fused to obtain the mask to extract the pedestrian detail features. Because the APMG lacks the characteristics of the lower half part of the body, the MC module provided fuses the APMG and the PCB to jointly extract the local characteristic representation of the pedestrian. Further, the proposed WIPA module can interact context information between local features, and suppress the context information in slice features using position information contained in the mask. The two local feature extraction modes complement each other to make up for the deficiency. According to the pedestrian cross-mode re-identification method and device, global and local detail features are combined to be used as the representation of the pedestrian, and good effects are achieved in the cross-mode re-identification task of the pedestrian.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A cross-modal pedestrian re-identification method based on local detail features is characterized by comprising the following steps:

reading a SYSU-MM01 data set which contains pedestrian images in two modes of normal light and infrared rays, and performing data enhancement on the data set; dividing the training set into uniform batches, wherein each batch comprises 8 images from 8 identities, and four images in each mode; inputting a pair of cross-modal images with the same identity into a backbone network Resnet-50, and extracting a global feature map

Step (2), sending the image pair into a posture estimation network GCM to obtain 16 image pairs

Screening 9 high-quality heatmaps to generate a human body part mask according to the estimated quality of each heatmap on SYSU-MM 01;

step (3), selecting 9 heatmaps and a global feature map F _g Sending into an APMG (adaptive body part masking module) module; APMG learning F _g Generating the weight of each heatmap in a self-adaptive manner according to the contribution degree of the upper body part; dividing the selected heatmap into two groups of top part and mid part; downsampling heatmap to F using maximum pooling _g The size of the space of (a); then, the heatmap is added according to the weight to obtain the value of each part

Step (4), in order to compensate the lower body information lacking in APMG, a MC (mask compensation) module is provided; MC in PCBThe global feature is divided into three pieces along the vertical direction by the way of extracting the local feature, and the last piece is taken

As a characterization of the following itself; then, the local feature vector f is obtained by performing global pooling on the local feature vector f and the local feature map extracted by the mask in a combined manner _local ∈R ^3×C ；

Step (5), f _local A weighted intra-part attribute module (WIPA) is sent to mine the context relation between intra-part, and background information in low part characteristics is inhibited; finally, WIPA (wireless fidelity) measures the contribution degree of each part to generate a weight weighting characteristic; pooled global feature vector f _g And f _local Connecting along the pathway as a representation of a pedestrian;

step (6), in order to train the network to accurately capture the pedestrian mode invariant identity feature, the network is trained by three Loss functions, namely ID Loss, Center Cluster Loss and modification Learning Loss;

step (7), respectively extracting the characteristics of the query and the galery set, and calculating the similarity between the images in the query set and each image in the galery set; the Euclidean distance between the characteristic vectors is used as similarity measurement; and finally, sorting the images in the galery set according to the similarity to obtain a re-recognition result.

2. The cross-modal pedestrian re-identification method based on the local detail features as claimed in claim 1, wherein the step (2) is specifically as follows: sending a pair of pedestrian images into an attitude estimation network GCM to obtain heatmaps of 16 different joints of the human body; randomly extracting pedestrian images and observing the quality of the corresponding heatmap; finally, 9 of the 9 joints (chest, upper neck, crown, left and right shoulders, left hip, left and right elbows, left wrist) were selected from the 16 heatmaps

3. According to claimThe cross-modal pedestrian re-identification method based on the local detail features of claim 1, wherein the APMG in the step (3) can adaptively generate a mask to extract refined local features; the input of APMG is selected

And a global feature map F output by the backbone network _g Composition is carried out; specifically, the screened heatmaps were divided into two groups of P representing pedestrians, respectively _top (chest, upper neck, crown of head, left and right shoulders), with P _mid (left hip, left elbow, left wrist); then, f is mixed _g Incoming weight generating network G _w (. to) generate a weight for each heatmap

The calculation formula is as follows:

W _{h_map} ＝σ(G _w (GAP(F _g )))

σ (-) denotes sigmoid function, G _w (. consists of a convolution with a convolution kernel size of 1; g _w The purpose of the (-) is to learn the contribution degree of heatmap corresponding to human body parts according to the global features and generate corresponding weights; with generated weights W _{h_map} To P _top And P _mid Weighted summation of heatmaps of the two groups to obtain corresponding parts

The formula is as follows:

mask _top ＝W _{h_map} [P _top ] ^T Heatmap[P _top ]

mask _mid ＝W _{h_ma} p[P _mid ] ^T Heatmap[P _mid ]

[ P ] represents an element in the collection and corresponding position to P;

partitioning global feature graph F with mask _g Obtaining the characteristics of top part and mid part of the pedestrian

The division formula of the local features is as follows:

F _{l_top} ＝mask _top ⊙F _g

F _{l_mid} ＝mask _mid ⊙F _g 。

4. the method as claimed in claim 1, wherein the MC is used for compensating for the lack of lower body information of the mask and converting the MC into F _g Dividing into three parts along vertical direction, and taking the last part of feature diagram F _{l_low} As a representation of the lower body of the pedestrian; a local feature map F of three parts of the pedestrian _{l_low} ，F _{l_top} ，F _{l_mid} Obtaining a local characteristic f after global pooling _local ∈R ^3×C 。

5. The method according to claim 1, wherein the WIPA module in step (5) uses f to re-identify the pedestrian in cross-modal mode based on the local detail feature _local ∈R ^3×C As an input; firstly, self-attention calculation among local features is carried out; feeding it into three 1X 1 convolutional layers Q (-), K (-), and V (-), respectively obtaining the dimension c _k Query and key features of, and dimension c _v Value feature of (1); features obtained from attention calculation are

The self-attention calculation formula is:

h is the head number, by passing F _{l_top} ，F _l _ _mid Including body part information to help suppressSection characteristics F _{l_low} The background information of (1); because different parts of the pedestrian have different contribution degrees to the re-recognition task, the network can learn the weight of each local feature by self to enhance useful information; specifically, i set two fully-connected layers and one ReLU layer to learn local feature weights, and the calculation formula is as follows:

will f is _g And

6. The method according to claim 1, wherein the ID Loss in the step (6) is used to train the feature F _{l_low} Ensuring that it contains pedestrian lower body information; three loss functions joint training f _g And

final characteristics after connection.