CN112818931A

CN112818931A - Multi-scale pedestrian re-identification method based on multi-granularity depth feature fusion

Info

Publication number: CN112818931A
Application number: CN202110218857.0A
Authority: CN
Inventors: 云霄; 葛敏; 张晓光; 周成峰; 周恒�; 李岳健
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-05-18

Abstract

The invention discloses a multi-scale pedestrian re-identification method based on multi-granularity depth feature fusion, which comprises the steps of selecting a pedestrian re-identification data set, and preprocessing a training set in the data set; selecting a residual error network as a basic framework, wherein the residual error network comprises a global coarse-grained fusion learning branch, a local coarse-grained fusion learning branch and a local attention fine-grained fusion learning branch; adopting Softmax loss and triple loss as a re-recognition network monitor to train a pedestrian re-recognition network model; fusing network characteristics of different branches to serve as a final descriptor of the pedestrian, and taking the image of the pedestrian to be inquired as the input of a pedestrian re-identification network model to obtain a pedestrian re-identification result. The invention effectively relieves the pressure of the complex background or posture change on the re-recognition task and improves the recognition precision.

Description

Multi-scale pedestrian re-identification method based on multi-granularity depth feature fusion

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a pedestrian re-identification method.

Background

Pedestrian re-identification is a technology for searching a given pedestrian under a camera according to wearing, posture, hair style and other information of the pedestrian by using a computer vision algorithm. When a specific pedestrian in one monitoring device is determined, the pedestrian can be retrieved from other non-overlapping camera devices through the method and tracking identification is carried out. In recent years, the method is widely applied to safety monitoring in public places in combination with video pedestrian tracking and detecting technologies.

Traditional research methods are mostly based on manual design features, but with the rapid development of deep learning, the traditional research methods are gradually replaced. The convolutional neural network is one of typical feature extraction methods in deep learning, can automatically learn and obtain features from data samples, and improves the performance of a pedestrian re-identification system, so that the convolutional neural network is widely applied to the field. However, in a real scene, deploying a pedestrian re-identification system still suffers from many factors, for example, images acquired from monitoring equipment are blurred, the posture of a pedestrian changes, the angle of a camera is different, and occlusion interference occurs, so that the identification rate is low.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, the invention provides a multi-scale pedestrian re-identification method based on multi-granularity depth feature fusion.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

the multi-scale pedestrian re-identification method based on the multi-granularity depth feature fusion comprises the following steps:

(1) selecting a pedestrian re-identification data set, and preprocessing a training set in the data set;

(2) selecting a residual error network as a basic framework, wherein the residual error network comprises a global coarse-grained fusion learning branch, a local coarse-grained fusion learning branch and a local attention fine-grained fusion learning branch;

(3) learning multi-level coarse-grained characteristic information of the pedestrian by adopting a global coarse-grained fusion learning branch;

(4) extracting pedestrian local features from the local area by adopting a local coarse-grained fusion learning branch;

(5) adopting a local attention fine-grained fusion learning branch, introducing an attention mechanism to eliminate background interference, and extracting pedestrian fine-grained local features;

(6) adopting Softmax loss and triple loss as a re-recognition network monitor to train a pedestrian re-recognition network model;

(7) fusing network features of different branches to serve as a final descriptor of a pedestrian, taking an image of the pedestrian to be inquired as input of a pedestrian re-identification network model, retrieving from a candidate gallery, calculating characteristic distances between the image of the pedestrian to be inquired and all images in the candidate gallery, and sorting according to the characteristic distances, wherein the image in the candidate gallery with the closest characteristic distance and the image of the pedestrian to be inquired are data of the same pedestrian.

Further, in step (1), the training set is preprocessed as follows:

(1a) fixing the image size of the pedestrians in the training set;

(1b) and carrying out horizontal turning, rotation, random clipping and normalization processing on the samples of each identity in the training set, thereby increasing the number of training samples.

Further, in the step (2), the step of selecting the residual error network as the basic skeleton is as follows:

(2a) parameter refinement is carried out on the ResNet50 backbone network 4 th residual error stage;

(2b) for the global coarse-grained fusion learning branch, starting from ResNet50 stage1,2 and 3, keeping ResNet50

stage

4 and 5, and using global average pooling processing on the obtained feature map to obtain a 2048-dimensional global feature vector f_{g_2048}；

(2c) For local coarse-grained fusion learning branch, the feature map of ResNet50 stage3 is divided into two equal parts horizontally and simultaneouslyThe downsample layer step size is set to 1 at ResNet50 stage 5; secondly, using a global average pooling layer to obtain the local characteristic f of each partition_{p_2048_1}And f_{p_2048_2}；

(2d) For the local attention fine-grained fusion learning branch, a convolution attention module is introduced on the basis of the local coarse-grained fusion learning branch to obtain the local characteristic f of each partition_{pab_2048_1}And f_{pab_2048_2}。

Further, in step (3), the method for learning the multi-level coarse-grained feature information of the pedestrian by adopting the global coarse-grained fusion learning branch includes that firstly, f is used_{g_2048}As the last pedestrian feature descriptor, training by using a triple loss function; next, f was processed using Conv1 × 1_{g_2048}To obtain 512-dimensional feature vector f_{g_512}And trained using the Softmax loss function.

Further, in step (4), the method for extracting the local features of the pedestrian for the local region by using the local coarse-grained fusion learning branch is that firstly, f is_{p_2048_1}As the last pedestrian feature descriptor, directly using a triple hard-to-load sample loss function for training; next, process f using the batch normalization layer, the nonlinear activation function ReLU, the Dropout layer, and the batch normalization layer in this order_{pab_2048_2}(ii) a Finally, a 512-dimensional f is obtained by Conv1 × 1_{p_512_1}And f_{p_512_2}Training was performed using the Softmax loss function.

Further, in the step (5), the method for extracting the fine-grained local features of the pedestrian by using the local attention fine-grained fusion learning branch includes that the attention mechanism includes a channel attention mechanism and a space attention mechanism; firstly, inputting a pedestrian picture, and obtaining an input feature mapping chart F through a ResNet50 basic skeleton; secondly, F enters the channel attention, two different one-dimensional eigenvectors are obtained by adopting average pooling and maximum pooling processing, and are transmitted to a shared multilayer perceptron to obtain a characteristic space matrix F'; then, obtaining a characteristic F' through space attention;

local features of each partition obtained by attention mechanismSign f_{pab_2048_1}And f_{pab_2048_2}First, f is_{pab_2048_1}Directly using a triple hard negative sample loss function as a final pedestrian feature descriptor for training; next, process f using the batch normalization layer, the nonlinear activation function ReLU, the Dropout layer, and the batch normalization layer in this order_{pab_2048_2}(ii) a Finally, a 512-dimensional f is obtained by Conv1 × 1_{pab_512_1}And f_{pab_512_2}And trained using the Softmax loss function.

Further, the feature space matrix F' is obtained by:

F′＝M_C(F)×F

in the above formula, σ is an activation function, MLP denotes a multilayer perceptron, W₀And W₁Respectively, the weight of the MLP, AvgPool and MaxPool respectively represent the average pooling and maximum pooling,

and

two different modifiers were obtained after average pooling and maximum pooling, respectively.

Further, the spatial attention mechanism first generates two 2-dimensional feature maps using average pooling and maximum pooling

And

and concatenate them to generate a valid feature descriptor, which is then passed through convolutional layer f^7×7Generating a space mechanical drawing, compressing to obtain F':

F″＝M_S(F′)×F′

in the above equation, σ is the activation function, and AvgPool and MaxPool represent the average pooling and maximum pooling, respectively.

Further, in step (6), a joint loss function of the Softmax loss and the triplet loss is constructed:

Loss_total＝(1-w)Loss_softmax+wLoss_triplet

therein, Loss_totalBeing a joint Loss function, Loss_softmaxAs a Softmax Loss function, Loss_tripletIs a triple loss function, w is a balance coefficient, w belongs to (0,1), N is the training sample batch number, C is the training sample class number, f_iGiven an input feature vector, y_iFor its corresponding tag, W_iAnd b_iIs the weight vector and offset, W, of sample i_yiAnd b_yiIs y_iThe superscript T denotes transposition, alpha is the boundary between a positive and negative sample pair,

features [ X ] extracted from the fixed picture, the positive sample picture, and the negative sample picture respectively through the network]₊Max (X, 0); and selecting the most dissimilar positive sample and the most similar negative sample from the batch to form the triad for each anchor point sample in the batch.

Adopt the beneficial effect that above-mentioned technical scheme brought:

the design of the invention comprises a global coarse-grained fusion learning branch, a local coarse-grained fusion learning branch and a local attention fine-grained fusion learning branch multi-branch network structure, thus relieving the pressure of complex background or pedestrian posture change on a re-recognition task, and learning global information and effective local distinguishing characteristics of pedestrians. According to the invention, a Softmax loss and triple loss function combined training network is adopted, so that the characteristic distance between the sample data of the same type of pedestrians is closer, the sample data of different types of pedestrians is farther, and the re-identification performance of the pedestrians is improved.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a schematic diagram of a Market-1501 data set in the embodiment;

FIG. 3 is a schematic diagram of an embodiment of a network framework;

FIG. 4 is a CBAM framework map of an embodiment;

FIG. 5 is a pedestrian retrieval map of an embodiment Market-1501 data set;

FIG. 6 is a schematic diagram of an embodiment marker-1501 data set w parameter performance index.

Detailed Description

The technical scheme of the invention is explained in detail in the following with the accompanying drawings.

The invention designs a multi-scale pedestrian re-identification method based on multi-granularity depth feature fusion, as shown in figure 1, the steps are as follows:

step 1: selecting a pedestrian re-identification data set, and preprocessing a training set in the data set;

step 2: selecting a residual error network as a basic framework, wherein the residual error network comprises a global coarse-grained fusion learning branch, a local coarse-grained fusion learning branch and a local attention fine-grained fusion learning branch;

and step 3: learning multi-level coarse-grained characteristic information of the pedestrian by adopting a global coarse-grained fusion learning branch;

and 4, step 4: extracting pedestrian local features from the local area by adopting a local coarse-grained fusion learning branch;

and 5: adopting a local attention fine-grained fusion learning branch, introducing an attention mechanism to eliminate background interference, and extracting pedestrian fine-grained local features;

step 6: adopting Softmax loss and triple loss as a re-recognition network monitor to train a pedestrian re-recognition network model;

and 7: fusing network features of different branches to serve as a final descriptor of a pedestrian, taking an image of the pedestrian to be inquired as input of a pedestrian re-identification network model, retrieving from a candidate gallery, calculating characteristic distances between the image of the pedestrian to be inquired and all images in the candidate gallery, and sorting according to the characteristic distances, wherein the image in the candidate gallery with the closest characteristic distance and the image of the pedestrian to be inquired are data of the same pedestrian.

Referring to fig. 2, the present embodiment selects a Market-1501 data set, which is images of 1501 pedestrians captured by five high-resolution and one low-fraction camera devices in a qinghua university campus, from a pedestrian re-recognition field public data set. The image is a sample of partial data of a Market-1501 data set, and the image quality is uneven, the pedestrian posture is greatly changed, and the background is noisy. Preprocessing a training set of the training system, specifically:

s1.1: setting the image size of any sample data in the training set to be 256 multiplied by 128 in a uniform size;

s1.2: horizontally turning and rotating any sample data in the training set, then cutting the random length-width ratio to be 0.75, and then scaling the image obtained after cutting to be 256 multiplied by 128; the pixel values are then normalized.

In this embodiment, a Market-1501 data set is selected as the pedestrian data research content, as shown in fig. 3, the network framework of this embodiment is developed based on ResNet50, and the specific steps are as follows:

s2.1: parameter refinement is carried out on the ResNet50 backbone network 4 th residual error stage;

s2.2: for the global coarse-grained fusion learning branch, starting from ResNet50 stage1,2,3, keeping

ResNet50 stage

4,5, and using the obtained feature map to use the globalAverage pooling treatment to obtain 2048-dimensional global feature vector f_{g_2048}；

S2.3: for local coarse-grained fusion learning branches, the feature map of the ResNet50 stage3 is divided into two equal parts horizontally, and the step size of a downsampling layer is set to be 1 at the ResNet50 stage 5; secondly, using a global average pooling layer to obtain the local characteristic f of each partition_{p_2048_1}And f_{p_2048_2}；

S2.4: for the local attention fine-grained fusion learning branch, a convolution attention module is introduced on the basis of the local coarse-grained fusion learning branch to obtain the local characteristic f of each partition_{pab_2048_1}And f_{pab_2048_2}。

In this embodiment, the method for learning the multi-level coarse-grained characteristic information of the pedestrian by using the global coarse-grained fusion learning branch includes that firstly, f is_{g_2048}As the last pedestrian feature descriptor, training by using a triple loss function; next, f was processed using Conv1 × 1_{g_2048}To obtain 512-dimensional feature vector f_{g_512}And trained using the Softmax loss function.

In the embodiment, the method for extracting the local features of the pedestrian for the local area by adopting the local coarse-grained fusion learning branch is that firstly, f is_{p_2048_1}As the last pedestrian feature descriptor, directly using a triple hard-to-load sample loss function for training; next, process f using the batch normalization layer, the nonlinear activation function ReLU, the Dropout layer, and the batch normalization layer in this order_{pab_2048_2}(ii) a Finally, a 512-dimensional f is obtained by Conv1 × 1_{p_512_1}And f_{p_512_2}Training was performed using the Softmax loss function.

In this embodiment, the method for extracting the fine-grained local features of the pedestrian by using the local attention fine-grained fusion learning branch includes, as shown in fig. 4, that the attention mechanism (CBAM) includes a channel attention mechanism and a spatial attention mechanism; firstly, inputting a pedestrian picture, and obtaining an input feature mapping chart F through a ResNet50 basic skeleton; secondly, F enters the channel attention, two different one-dimensional eigenvectors are obtained by adopting average pooling and maximum pooling processing, and are transmitted to a shared multilayer perceptron to obtain a characteristic space matrix F'; then, after a further spatial attention, feature F' is obtained.

The feature space matrix F' is obtained by:

F′＝M_C(F)×F

and

The spatial attention mechanism first generates two 2-dimensional feature maps by using average pooling and maximum pooling

And

F″＝M_S(F′)×F′

local features f of the partitions obtained by the attention mechanism_{pab_2048_1}And f_{pab_2048_2}First, f is_{pab_2048_1}Direct use of three as last pedestrian feature descriptorTraining a tuple hard negative sample loss function; next, process f using the batch normalization layer, the nonlinear activation function ReLU, the Dropout layer, and the batch normalization layer in this order_{pab_2048_2}(ii) a Finally, a 512-dimensional f is obtained by Conv1 × 1_{pab_512_1}And f_{pab_512_2}And trained using the Softmax loss function.

In this example, a joint loss function of Softmax loss and triplet loss is constructed:

Loss_total＝(1-w)Loss_softmax+wLoss_triplet

In this embodiment, a multiscale network with multi-granularity depth feature fusion is initialized using ResNet50 weights trained on Imagenet in advance, where weights of different branches are shared. In this embodiment, each small batch of P identities is randomly selected to be sampled, and each identity randomly samples K images from a training set to meet the requirement of a triplet, where P is 16 and K is 4 in an experiment, and an SGD optimizer is selected, where weight attenuation is set to 5e-4 and momentum is 0.9. This embodiment sets the total number of training times to 240, the initial learning rate to 0.1, and the learning rate to be divided by 10 every iteration 40 times until the final learning rate to 0.001 remains unchanged, while setting the balance coefficient w to 0.6.

Fig. 5 is a person search graph on the Market-1501 data set in the present embodiment, the images in the first column represent query graphs, and the pictures searched from the gallery are sorted according to cosine similarity of 1 to 10. It can be seen from the ranking order that most of the retrieved images were correctly selected, probably due to insufficient image information collected from the single-view camera, but still some erroneous images labeled with red numbers and detection boxes.

FIG. 6 shows the effect of the w parameter of this embodiment. The Rank-1 represents the probability that the retrieved picture and the picture to be queried are the same identity, and the mAP represents the average precision that the retrieved picture and the picture to be queried are the same identity. When w is 0, only a single Softmax loss is used for supervising the training network, and the convolution descriptor is used as a unique pedestrian feature descriptor, so that pedestrian feature information of different levels is not fully utilized. Meanwhile, the Softmax loss only learns separable features, so that the learned features are insufficient in discrimination. When w > 0 invariance learning, the method of combined Softmax loss and triplet loss supervised training improves significantly. The effect is best when w is 0.6, the effectiveness of the method provided by the invention is verified, the respective defects are mutually compensated by combining Softmax loss and triple loss function supervised learning, and the characteristics of multilevel and finer granularity can be learned. But when w is 1, f is directly connected due to local fusion branch_{pab_2048_1}And f_{pab_2048_2}As a final descriptor, using triple loss alone as supervised training is not as effectiveAnd (5) performing combined training.

The embodiments are only for illustrating the technical idea of the present invention, and the technical idea of the present invention is not limited thereto, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the scope of the present invention.

Claims

1. The multi-scale pedestrian re-identification method based on multi-granularity depth feature fusion is characterized by comprising the following steps of:

2. The multi-scale pedestrian re-identification method based on multi-granularity depth feature fusion according to claim 1, wherein in the step (1), the step of preprocessing the training set is as follows:

(1a) fixing the image size of the pedestrians in the training set;

3. The multi-scale pedestrian re-identification method based on multi-granularity depth feature fusion according to claim 1, wherein in the step (2), the step of selecting the residual error network as the basic skeleton comprises the following steps:

(2b) for the global coarse-grained fusion learning branch, starting from ResNet50 stage1,2 and 3, keeping ResNet50 stage 4 and 5, and using global average pooling processing on the obtained feature map to obtain a 2048-dimensional global feature vector f_{g_2048}；

(2c) For local coarse-grained fusion learning branches, the feature map of the ResNet50 stage3 is divided into two equal parts horizontally, and the step size of a downsampling layer is set to be 1 at the ResNet50 stage 5; secondly, using a global average pooling layer to obtain the local characteristic f of each partition_{p_2048_1}And f_{p_2048_2}；

4. The method for re-identifying pedestrians based on multi-level coarse-grained feature fusion of claim 3 is characterized in that in step (3), the method for learning the multi-level coarse-grained feature information of pedestrians by using the global coarse-grained fusion learning branch is that firstly, f is divided into_{g_2048}As the last pedestrian feature descriptor, training by using a triple loss function; next, f was processed using Conv1 × 1_{g_2048}To obtain 512-dimensional feature vector f_{g_512}And trained using the Softmax loss function.

5. The method for multi-scale pedestrian re-identification based on multi-granularity depth feature fusion of claim 3, wherein in the step (4), the local feature of the pedestrian is extracted from the local region by using the local coarse-granularity fusion learning branch by firstly dividing f_{p_2048_1}As the last pedestrian feature descriptor, directly using a triple hard-to-load sample loss function for training; next, process f using the batch normalization layer, the nonlinear activation function ReLU, the Dropout layer, and the batch normalization layer in this order_{pab_2048_2}(ii) a Finally, a 512-dimensional f is obtained by Conv1 × 1_{p_512_1}And f_{p_512_2}Training was performed using the Softmax loss function.

6. The multi-scale pedestrian re-identification method based on multi-granularity depth feature fusion is characterized in that in the step (5), the method for extracting the pedestrian fine-granularity local features by adopting the local attention fine-granularity fusion learning branch comprises the steps of a channel attention mechanism and a space attention mechanism; firstly, inputting a pedestrian picture, and obtaining an input feature mapping chart F through a ResNet50 basic skeleton; secondly, F enters the channel attention, two different one-dimensional eigenvectors are obtained by adopting average pooling and maximum pooling processing, and are transmitted to a shared multilayer perceptron to obtain a characteristic space matrix F'; then, obtaining a characteristic F' through space attention;

local features f of the partitions obtained by the attention mechanism_{pab_2048_1}And f_{pab_2048_2}First, f is_{pab_2048_1}Directly using a triple hard negative sample loss function as a final pedestrian feature descriptor for training; next, process f using the batch normalization layer, the nonlinear activation function ReLU, the Dropout layer, and the batch normalization layer in this order_{pab_2048_2}(ii) a Finally, a 512-dimensional f is obtained by Conv1 × 1_{pab_512_1}And f_{pab_512_2}And trained using the Softmax loss function.

7. The multi-scale pedestrian re-identification method based on multi-granularity depth feature fusion according to claim 6, wherein the feature space matrix F' is obtained by the following method:

F′＝M_C(F)×F

and

8. The method according to claim 6, wherein the spatial attention mechanism first generates two 2-dimensional feature maps by using average pooling and maximum pooling

And

F″＝M_S(F′)×F′

9. The multi-scale pedestrian re-identification method based on multi-granularity depth feature fusion as claimed in claim 1, wherein in step (6), a joint loss function of Softmax loss and triple loss is constructed:

Loss_total＝(1-w)Loss_softmax+wLoss_triplet