CN113792669B

CN113792669B - Pedestrian re-recognition baseline method based on hierarchical self-attention network

Info

Publication number: CN113792669B
Application number: CN202111087471.7A
Authority: CN
Inventors: 陈炳才; 张繁盛; 聂冰洋
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2024-06-14
Anticipated expiration: 2041-09-16
Also published as: CN113792669A

Abstract

The invention provides a pedestrian re-identification baseline method based on a hierarchical self-attention network, and belongs to the field of computer vision. In the invention, the Swin Transformer is creatively introduced into the pedestrian re-recognition field as a main network, and the weighted sum of the ID loss and the Circle loss is used as a loss function, so that the feature extraction capability is improved while the simple structure is ensured through effective data preprocessing and reasonable parameter adjustment. Compared with the traditional baseline method based on ResNet, the pedestrian re-recognition method provided by the invention has the advantage that the pedestrian re-recognition effect is obviously improved.

Description

Pedestrian re-recognition baseline method based on hierarchical self-attention network

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a pedestrian re-identification baseline method based on a hierarchical self-attention network.

Background

The pedestrian re-identification needs to identify a specific pedestrian in a cross-camera environment by utilizing a computer vision technology, a pedestrian monitoring image is given, the pedestrian image under cross-equipment is searched, and the identification of the specific pedestrian has very important significance for violation judgment, criminal investigation, danger early warning and the like.

A good baseline method should obtain good effects while ensuring a low parameter number, and the existing pedestrian re-identification baseline method is based on ResNet and is limited by limitation and inadequacy of a convolutional neural network on feature extraction, and the baseline method based on ResNet cannot obtain ideal effects.

As research proceeds, the transducer is increasingly being used in the field of computer vision. The existing pedestrian re-identification method based on the Transformer has the problems of overlarge calculated amount, single characteristic receptive field and the like.

Disclosure of Invention

The invention provides a pedestrian re-identification baseline method based on a hierarchical self-attention network, and aims to solve the problems in the background art, and achieve a good effect while having a simple structure.

The technical scheme of the invention is as follows:

a pedestrian re-identification baseline method based on a hierarchical self-attention network comprises the following specific steps:

Step one, data preprocessing;

Providing a total of N different pedestrians, wherein each pedestrian comprises M _i images, M _i>1,M_i represents the number of images in the class of the ith pedestrian, and i represents the ID number of each pedestrian; for the ith pedestrian, M _i -1 images are used as training sets, 1 image is used as verification set, i is used as a label, and the image is indicated to correspond to the ith pedestrian;

1.1 Using bicubic interpolation algorithm to scale the image to (H, W, C) as an input image, wherein H represents the length of the image, W represents the width of the image, C represents the number of channels of the image, and C takes a value of 3; the method comprises the following steps:

1.1.1 Construction Bicubic function:

Wherein a is a variable value in the coefficient and is used for controlling the shape of Bicubic curves;

1.1.2 Interpolation formula is as follows:

wherein (x, y) represents the pixel point to be interpolated, and for each pixel point, 4×4 pixel points nearby are taken to perform bicubic interpolation operation.

1.2 Data enhancement using a random erasure algorithm;

1.2.1 A threshold probability p) is set to generate a random number p1 of 0-1, when p1> p, the input image is not processed, otherwise, erasure is needed:

p1＝Rand(0,1) (3)

1.2.2 Determining an erasure area;

H_e＝Rand(H/8,H/4) (4)

W_e＝Rand(W/8,W/4) (5)

S_e＝H_e×W_e (6)

Wherein H represents the length of the input image and W represents the width of the input image; h _e denotes the length of erase, W _e denotes the width of erase, and S _e denotes the area of erase;

1.2.3 Determining erasure coordinates;

x_e＝Rand(0,H-H_e) (7)

y_e＝Rand(0,W-W_e) (8)

Where x _e represents the erased upper left corner x-coordinate and y _e represents the erased upper left corner y-coordinate.

Step two, inputting the preprocessed image into a layering self-attention network, namely a Swin transform neural network, and carrying out forward transmission;

the backbone network comprises 4 processing stages, wherein the 2-4 stage network structure is identical, and the specific steps are as follows:

2.1 Stage 1);

2.1.1 Block segmentation; starting from the upper left corner of the image, the input image is divided into a set of non-overlapping image blocks, where each image block is 4 x 4 in size, then the image is divided into a number of image blocks of size (4,4,3), where the number of image blocks N _patch is:

N_patch＝(H/4)×(W/4) (9)

2.1.2 Linear embedding; flattening each image block into a vector with dimension C through the full connection layer, and sending the image blocks into two continuous Swin blocks;

2.1.3 Swin block extraction features;

The Swin block comprises a Swin block 1 and a Swin block 2; the main structure of the Swin block 1 is a window multi-head self-attention module and a multi-layer sensor, layer standardization processing is carried out before the two modules, and residual connection is added after the two modules; the main structure of the Swin block 2 is a mobile window multi-head self-attention module and a multi-layer sensor, layer standardization processing is carried out before the two modules, and residual connection is added after the two modules;

After the Swin block is extracted, key characteristic information such as the head, the hands and the actions of the pedestrian can be obtained, and a characteristic set (H/4, W/4, C) is output;

2.2 Stage 2);

2.2.1 Block fusion; combining the input feature sets pairwise, adjusting the feature dimension to be twice as much as the original feature dimension by using a full connection layer, and outputting a feature set (H/8,W/8,2C);

2.2.2 Swin block extraction features; completely consistent with the structure of the Swin block in 2.1.3), and outputting a key feature set (H/8,W/8,2C) after the Swin block is processed;

2.3 Stage 3-4;

the network structure of the stage 3 and the stage 4 are completely consistent with that of the stage 2, and after the processing, feature sets (H/16, W/16,4C) and (H/32, W/32,8C) are respectively output;

2.4 A global average pooling layer and a full connection layer; and (3) carrying out global average pooling processing on the feature set output in the stage (4) to obtain a vector with the length of 8C, and mapping the feature into N through a full connection layer, wherein N is the type of the pedestrian in the data set in the step (I).

Step three, calculating a loss function, and reversely transmitting and updating network parameters;

3.1 The loss function consists of two parts, namely an ID loss and a Circle loss, and the formula is as follows:

L_reid＝w₁L_id+w₂L_circle (10)

Wherein w ₁ and w ₂ represent the weights of the ID loss and the Circle loss, respectively; l _reid denotes the total loss function, L _id denotes the ID loss, and L _circle denotes the Circle loss;

3.2 ID loss formula is as follows:

Where n represents the number of samples per batch training and p (y _i|x_i) represents the conditional probability that the input image x _i is set to the label y _i;

3.3 Circle loss formula is as follows:

Δn＝m (13)

Δm＝1-m (14)

Wherein N represents the number of categories of different pedestrians, and M _i represents the number of images in the category of the ith pedestrian; gamma is a scale parameter; m is the stringency of the optimization; s _n is an inter-class similarity score matrix, and S _p is a similarity score matrix; a _n and a _p are non-negative matrices, weight matrices of S _n and S _p, respectively, and are formulated as follows:

Wherein S _n is an inter-class similarity score matrix, and S _p is an inter-class similarity score matrix;

3.4 Setting super parameters, and training a network; adopting a preheating learning rate, and initializing the learning rate as r, wherein the r is gradually increased to ten times in the first 10 training steps; the optimizer adopts an optimized random gradient descent algorithm to increase the weight attenuation with the value of d ₁ and the momentum with the value of d ₂; and (3) carrying out back propagation by using the set optimizer and learning rate and combining the loss values calculated in 3.1) to 3.3), and updating network parameters.

Step four, pedestrian re-identification matching is carried out;

And (3) inputting the pedestrian image to be detected into the Swin transducer neural network in the second step after the pedestrian image to be detected is scaled, obtaining output, and processing by using softmax to obtain N probability values, wherein the probability values respectively correspond to the probabilities that the pedestrians belong to different classes, and the class with the largest probability value is the identity of the pedestrian.

The invention has the beneficial effects that: the invention provides a pedestrian re-recognition baseline method based on a hierarchical self-attention network, creatively introduces a Swin converter as a main network into the pedestrian re-recognition field, takes the weighted sum of ID loss and Circle loss as a loss function, and greatly improves the training effect while ensuring a simple structure through effective data preprocessing and reasonable parameter adjustment.

Drawings

FIG. 1 is a diagram of the overall improved concept of the present invention;

FIG. 2 is a model diagram of a pedestrian re-recognition baseline method based on a hierarchical self-attention network of the present invention;

Fig. 3 is a schematic diagram of the structure of the Swin block.

Detailed Description

The following describes embodiments of the present invention in detail with reference to the accompanying drawings, and the present embodiment is implemented on the premise of the technical solution of the present invention, and provides detailed embodiments and specific operation procedures. The data set of the specific experiment is a mark 1501 data set collected in a certain university, the training set has 751 people and comprises 12936 images; the test set had 750 people and contained 19732 images.

Fig. 1 is an overall improved idea diagram of the present invention, and fig. 2 is a model diagram of a pedestrian re-recognition baseline method based on a hierarchical self-attention network, where specific steps in this embodiment are as follows:

Step one, data preprocessing;

In the training set, 751 people and 751 are arranged, each pedestrian comprises M _i images, wherein M _i>1,M_i represents the number of images in the class of the ith pedestrian, and i represents the ID number of each pedestrian; for the ith pedestrian, M _i -1 images are used as training sets, 1 image is used as verification set, i is used as a label, and the image is indicated to correspond to the ith pedestrian;

1.1 Using bicubic interpolation algorithm, scaling the image to (224,224,3), where H represents the length of the image, W represents the width of the image, and C represents the number of channels of the image, as follows:

1.1.1 Construction Bicubic function:

Wherein a= -0.5, which is a variable value in the coefficient, for controlling the shape of Bicubic curve;

1.1.2 Interpolation formula is as follows:

1.2 Data enhancement using a random erasure algorithm;

1.2.1 A threshold probability p=0.5), generating a random number p1 of 0-1, when p1> p, the image is not processed, otherwise, erasure is needed:

p1＝Rand(0,1) (3)

1.2.2 Determining an erasure area;

H_e＝Rand(H/8,H/4) (4)

W_e＝Rand(W/8,W/4) (5)

S_e＝H_e×W_e (6)

1.2.3 Determining erasure coordinates;

x_e＝Rand(0,H-H_e) (7)

y_e＝Rand(0,W-W_e) (8)

2.1 Stage 1);

N_patch＝(H/4)×(W/4) (9)

Wherein, H, W refer to the length and width of the input image, respectively, where N _patch =56 x56;

2.1.2 Linear embedding; flattening each image block into a vector with a dimension of 128 through the full connection layer, and feeding the vector into two continuous Swin blocks;

2.1.3 Swin block extraction features;

As shown in fig. 3, the Swin block includes a Swin block 1 and a Swin block 2; the main structure of the Swin block 1 is a window multi-head self-attention module and a multi-layer sensor, layer standardization processing is carried out before the two modules, and residual connection is added after the two modules; the main structure of the Swin block 2 is a mobile window multi-head self-attention module and a multi-layer sensor, layer standardization processing is carried out before the two modules, and residual connection is added after the two modules;

After the Swin block is extracted, key characteristic information such as the head, the hands and the actions of the pedestrian can be obtained, a characteristic set (56,56,128) is output, and the key characteristic information is transmitted to the next module;

2.2 Stage 2);

2.2.1 Block fusion; combining the input feature sets pairwise, adjusting the feature dimension to be twice as much as the original dimension by using a full-connection layer, and outputting a feature set (28,28,256);

2.2.2 Swin block extraction features; the structure is completely consistent with that of 2.1.3), and after the Swin block processing, a feature set (28,28,256) is output;

2.3 Stage 3-4;

the structures of the stage 3 and the stage 4 are completely consistent with those of the stage 2, and after the processing, feature sets (14,14,512) and (7,7,1024) are respectively output;

2.4 A global average pooling layer and a full connection layer; and (3) carrying out global average pooling processing on the feature set output in the stage 4 to obtain a vector with the length of 1024, and mapping the feature into 751 classes through a full connection layer, wherein 751 is the class of pedestrians in the data set used in the embodiment.

L_reid＝w₁L_id+w₂L_circle (10)

wherein w ₁ and w ₂ represent the weights of ID loss and Circle loss, respectively, w ₁ takes a value of 0.4, and w ₂ takes a value of 0.6; l _reid denotes the total loss function, L _id denotes the ID loss, and L _circle denotes the Circle loss;

3.2 ID loss formula is as follows:

Where n represents the number of samples per batch training, the value 16 of this embodiment, p (y _i|x_i) represents the conditional probability that the input image x _i is set to the label y _i;

3.3 Circle loss formula is as follows:

Δn＝m (13)

Δm＝1-m (14)

Wherein N represents the number of different pedestrians, and the value 751 is taken in the embodiment; m _i represents the number of images in the class of the ith pedestrian; gamma is the scale parameter, 32 in this example; m is the stringency of the optimization, taking 0.25 in this example; s _n is an inter-class similarity score matrix, and S _p is a similarity score matrix; a _n and a _p are non-negative matrices, weight matrices of S _n and S _p, respectively, and are formulated as follows:

3.4 The super parameter setting when training the neural network is shown in table 1, the back propagation is performed by using the set optimizer and learning rate and combining the loss values calculated in 3.1) to 3.3), and the network parameters are updated.

Table 1 super parameter settings for training networks

Step four, pedestrian re-identification matching is carried out;

And (3) inputting the pedestrian image to be detected into the Swin transducer neural network in the second step after the pedestrian image to be detected is scaled, obtaining and outputting the pedestrian image, and processing the pedestrian image by using softmax to obtain 751 probability values, wherein the probability value of the pedestrian belongs to different classes, and the class with the largest probability value is the identity of the pedestrian.

In this embodiment, pedestrian re-recognition effect test is performed based on the mark 1501 dataset, and the pedestrian re-recognition effect test is compared with the existing pedestrian re-recognition baseline model based on global features, as shown in table 2:

table 2 comparison of results with existing baseline model

The baseline model provided by the invention can effectively improve the Rank1 and mAP indexes of pedestrian re-identification through comparison of experimental results, proves the effectiveness of the method, and has great promotion significance for practical application of pedestrian re-identification; in addition, the network structure is simpler, has strong expandability, and has great reference significance for the design of a pedestrian re-identification method in the future.

Claims

1. A hierarchical self-attention network-based pedestrian re-recognition baseline method, characterized in that the method comprises the following steps:

Step one, data preprocessing;

1.1.1 Construction Bicubic function:

1.1.2 Interpolation formula is as follows:

Wherein (x, y) represents the pixel point to be interpolated, and for each pixel point, 4×4 pixel points nearby are taken for bicubic interpolation operation;

1.2 Data enhancement using a random erasure algorithm;

p1＝Rand(0,1) (3)

1.2.2 Determining an erasure area;

H_e＝Rand(H/8,H/4) (4)

W_e＝Rand(W/8,W/4) (5)

S_e＝H_e×W_e (6)

1.2.3 Determining erasure coordinates;

x_e＝Rand(0,H-H_e) (7)

y_e＝Rand(0,W-W_e) (8)

Where x _e represents the erased upper left corner x coordinate and y _e represents the erased upper left corner y coordinate;

2.1 Stage 1);

N_patch＝(H/4)×(W/4) (9)

2.1.3 Swin block extraction features;

After the Swin block is extracted, the key characteristic information of the head, the hand and the action of the pedestrian is obtained, and a characteristic set (H/4, W/4, C) is output;

2.2 Stage 2);

2.3 Stage 3-4;

2.4 A global average pooling layer and a full connection layer; carrying out global average pooling treatment on the feature set output in the stage 4 to obtain a vector with the length of 8C, and mapping the feature into N through a full connection layer, wherein N is the type of pedestrians in the data set in the step one;

L_reid＝w₁L_id+w₂L_circle (10)

3.2 ID loss formula is as follows:

3.3 Circle loss formula is as follows:

Δn＝m (13)

Δm＝1-m (14)

3.4 Setting super parameters, and training a network; adopting a preheating learning rate, and initializing the learning rate as r, wherein the r is gradually increased to ten times in the first 10 training steps; the optimizer adopts an optimized random gradient descent algorithm to increase the weight attenuation with the value of d ₁ and the momentum with the value of d ₂; using the set optimizer and learning rate, and combining the loss values calculated in 3.1) to 3.3), carrying out back propagation, and updating network parameters;

step four, pedestrian re-identification matching is carried out;