CN115731576A

CN115731576A - Unsupervised pedestrian re-identification method based on key shielding area

Info

Publication number: CN115731576A
Application number: CN202211474511.8A
Authority: CN
Inventors: 谢将凤; 林菲; 张聪
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-03-03

Abstract

The invention provides an unsupervised pedestrian re-identification method based on a shielding key area, which comprises the following steps: preprocessing a non-label picture data set, and inputting the preprocessed picture data set into a network model; constructing a deep learning model, and acquiring a key area of a picture by using a space attention module and shielding the key area; clustering the feature codes of the pictures to obtain pseudo labels of the pictures; constructing a loss function based on a difficult sample updating strategy and cluster updating; obtaining a trained network model according to the change condition of the loss function; and inputting the pictures and videos of the pedestrians to be identified into the trained network model, and outputting the result of the pedestrian re-identification. The method can avoid the network model from excessively paying attention to the local features or the global features of the image, and effectively improves the generalization and the robustness of the model.

Description

Unsupervised pedestrian re-identification method based on key shielding area

Technical Field

The invention relates to the technical field of pedestrian re-identification, in particular to an unsupervised pedestrian re-identification method based on a shielding key area.

Background

Pedestrian re-identification is one of the popular research directions in the field of computer vision in recent years, and is an image retrieval problem, namely a technology for judging whether a specific pedestrian exists in an image or a video by using a computer vision technology, namely, a picture of the pedestrian under a monitoring pedestrian image retrieval cross-device is given.

With the development of science and technology, the pedestrian re-identification technology has been widely applied to the fields of intelligent security, video monitoring and the like. At present, the pedestrian re-identification has made a great breakthrough in the marked supervision field and shows superior performance. However, label-free data is often very inexpensive and large in scale due to the high labor costs involved in labeling data sets. How to train a deep learning model by using a large unmarked data set is more and more concerned by people, and unsupervised pedestrian re-identification comes along, however, due to the lack of marking of image data, the unsupervised learning-based pedestrian re-identification method is often difficult to achieve the retrieval accuracy rate which can be achieved by a supervised learning method.

The unsupervised pedestrian re-identification is to learn the characteristic representation of the pedestrian by using a data set without marks, and the pedestrian characteristics extracted by deep learning can be divided into two types: global features and local features. Global features typically contain the most intuitive information in a pedestrian picture, such as the color of the pedestrian's clothing; local features are detailed parts of the image local, such as hats, backpacks, etc.

At present, most popular completely unsupervised pedestrian re-identification methods adopt a clustering algorithm to generate pseudo labels for unlabeled samples, so that models are trained in a supervised mode, and the mode often causes the network models to pay too much attention to local characteristics or global characteristics of images, so that the problems of low generalization of the models, performance reduction and the like are caused.

Disclosure of Invention

The invention solves the problems of how to avoid the network from excessively paying attention to the local features or the global features of the image, thereby improving the generalization of the model, and how to optimize the loss function, thereby further improving the robustness of the network. In order to solve the problems, the invention provides an unsupervised pedestrian re-identification method based on a shielding key area, which comprises the following steps:

s1: obtaining pedestrian picture data set without label

N represents the number of pictures in the data set, xi represents the ith pedestrian picture in the data set, the size of each picture is adjusted to be the same in height and width, and preprocessing is carried out;

s2: constructing a deep learning model, inputting the preprocessed training data into a network, and extracting the characteristics of the picture sample;

s3: clustering the extracted features to obtain a pseudo label;

s4: and updating a clustering center and a hard sample set according to a clustering result, calculating total loss, returning and updating network parameters in a gradient manner, and storing the optimal parameters of the network.

S5: and carrying out pedestrian re-identification by using the trained network model, inputting the picture and the video of the pedestrian to be inquired, and outputting pedestrian re-identification information.

In the method, the feature vectors calculated by the convolutional neural network are clustered, so that the samples of the same pedestrian from different cameras can be effectively gathered into the same class, and each sample can obtain a pseudo label. In unsupervised pedestrian re-identification, a contrast loss function based on a difficult sample strategy is commonly used, a picture with a key area shielded actually becomes a difficult sample, and the difficult sample is put into a convolutional neural network for training, so that on one hand, the contrast loss function based on the difficult sample is skillfully utilized, and the iterative update of a neural network model is promoted; on the other hand, the local characteristics of the model can be prevented from being too much concerned, and the robustness of the network model is improved.

Further, the preprocessing in S1 is to adjust all the pictures to the same size, and perform enhancement processing on the pictures, including three modes, i.e., horizontal flipping, rotation by a certain angle, and standardization.

In S2, the deep learning model comprises two modules: attention Module (Attention Module) and the ResNet50 model. The Attention Module (Attention Module) adopts a spatial Attention Module in a CBAM model, and outputs a matrix with the same size as a picture for reflecting an image region concerned by the current network model. The occlusion function g indicates whether the current image is occluded, and the occluded image is represented by a function f (x). The image processed by g (f (x)) is input to the next module ResNet 50.

Further, the occlusion function g determines whether to perform occlusion by setting a threshold. And the function f represents the shielding of the region with the highest output weight of the spatial attention module.

By shielding the area with the highest attention of the original image, the network model can be forced to pay more attention to the whole area; by selecting a certain proportion of pictures for shielding, the network model can comprehensively learn the local characteristics and the global characteristics of the pictures, so that the generalization of the model is improved.

In S3, the steps include:

s31: and calculating Jaccard distances between every two pictures according to the output characteristics of all the pictures to obtain an NxN-dimensional distance matrix.

S32: and combining the obtained distance matrix, clustering by adopting a DBSCAN algorithm, endowing the same pseudo label to the sample in one cluster, and endowing the pseudo label of the nearest cluster to the outlier generated by clustering.

In S4, the steps include:

s41, in order to better classify different pedestrians, a loss function based on a difficult sample is adopted

And cluster-based loss function

Are integrated, i.e.

S42, using a memory module mode to store the difficult samples in a memory and update a difficult sample set in a kappa [ j ] mode]＝μκ[j]-(1-μ)f ^hard I.e. to jDifficult positive samples to which class belongs are updated, f ^hard Indicates that the current batch is on k [ j ]]The hard samples of (i.e. feature vector and k [ j ]]The sample whose cosine distances differ the most.

S43, loss function based on the hard sample:

wherein C is the number of categories after the pedestrian features are clustered, q is the feature vector of the current sample, and k ⁺ Is the feature vector, k, of the hard sample of the class to which sample q belongs ⁱ Tau is a temperature coefficient and is a characteristic vector of the class to which each difficult sample belongs, is used for controlling the scale of similar samples,<>indicating that the cosine distance between two vectors is calculated.

S44, storing the clustering centers of each category in a memory mode by using a memory module, wherein the computing mode of the clustering centers is the average value of all eigenvectors with the same pseudo label after clustering, and the clustering centers are updated after each iteration, and the specific mode is c ⁱ ＝αc ⁱ +(1-α)c ^i-1 ,c ⁱ Representing the cluster center obtained by the i-th iteration calculation, c ^i-1 The cluster center obtained by the i-1 st round of calculation is shown.

S45, loss function based on clustering:

wherein C is the number of categories after the pedestrian features are clustered, q is the feature vector of the current sample, and C ⁺ As the cluster center of the class to which the sample q belongs, c ⁱ Is the characteristic vector of each cluster center, tau is a temperature coefficient and is used for controlling the scale of similar samples,<>indicating that the cosine distance between the two vectors is calculated.

S46: inputting the data set picture into the network for training, and setting training parameters, comprising: setting the number of nodes of a randomly lost hidden layer during each training, setting the times of sample training in all training sets, setting learning rate, selecting an optimizer, judging whether loss is converged according to a loss curve, and the like, and storing a trained model.

S5: inputting a pedestrian picture to be identified and a video, extracting pedestrians in the video to construct a candidate pedestrian picture set, calculating the cosine distance between each pedestrian in the candidate pedestrian picture set and the pedestrian picture to be identified, sequencing the pedestrians in the picture according to the distance, sequencing the candidate pedestrian picture set according to the similarity degree, outputting the first M most similar pictures, and completing pedestrian re-identification.

The invention has the substantive characteristics that:

1. the method reflects the key area of the picture through the space attention module and shields the key area of the picture with a certain proportion. The shielded pedestrian picture can keep a common place with the pictures in the class, and can reduce the distinguishing degree with other classes to a certain extent, so that the pedestrian picture becomes a difficult sample with a fuzzy boundary. On the basis, by combining the info loss function based on difficult sample mining, the feature codes of the neural network can be effectively kept compact within classes and scattered among classes.

2. The invention improves the selection mode of the difficult samples, particularly updates the difficult samples by using the mode of the memory module, and selects k [ j ] in the distance memory module in each batch]Farthest feature vector f ^hard Updating the features stored in the memory module, i.e. k [ j ]]＝μκ[j]-(1-μ)f ^hard . Compared with the traditional difficult sample selection strategy, the method only updates the difficult sample once in each round of batch, and does not search the difficult sample for each sample again, so that the training and convergence of the model are accelerated to a certain extent.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a convolutional neural network architecture diagram of the present invention.

Detailed Description

The technical solution of the present invention is further specifically described below by way of specific examples in conjunction with the accompanying drawings.

Example 1

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. The present invention is not limited to this embodiment.

The embodiment provides an unsupervised pedestrian re-identification method based on a blocking key area, as shown in fig. 1 and 2, the method includes the steps of:

s1: obtaining a pedestrian picture dataset without a tag

s3: clustering the extracted features to obtain a pseudo label;

S5: and putting the trained neural network model into use.

In S1, a pedestrian data set MarKet-1501 is used for training a convolutional neural network model, and all pictures are adjusted to 64x128 in size. And performing enhancement processing on the picture, specifically including horizontal turning, clockwise and anticlockwise rotation by 30 degrees and standardization.

In S2, an Attention Module (Attention Module) and a residual network Module (ResNet 50) are integrated. The Attention Module (Attention Module) outputs a matrix with the same size as the picture to reflect the image area concerned by the current network model, the occlusion function g represents whether the current image is occluded, and the occluded image is represented by a function f (x). The image after g (f (x)) processing is input into the next module ResNet 50.

Further, the attention mechanism of the CBAM model includes two parts of space attention and channel attention, and the invention only adopts the space attention part. In order to obtain attention characteristics in spatial dimensions, global maximum pooling and global average pooling are performed based on the width and height of a characteristic map, the characteristic dimension is converted from 64x128 to 1x1, then the dimension of the characteristic map is reduced after convolution with a convolution kernel of 7x7 and a Relu activation function, and then the dimension is improved to the original dimension after one convolution. The output of the CBAM spatial attention module reflects the area of interest of the current network.

Further, the occlusion function g determines whether to perform occlusion by setting a threshold. The occlusion function g can be expressed as

q represents a range of [0,1]A threshold value is set to 0.3.

Further, the function f represents that a region with the highest output weight of the spatial attention module is shielded, specifically, the output weight region with the size of 64x128 is divided into 8x16 sub-regions with the size of 8x8, all values of the sub-regions where the highest weight is located are set to be 0, and then the output weight region with the size of 64x128 is multiplied by the feature vector of the original image and input into the ResNet network.

By shielding the area with the highest attention of the original image, a hard sample which can keep common characteristics with the original category and can reduce the distinction degree with other categories of pictures is generated, the network model can focus on the whole area more by selecting the pictures with a certain proportion for shielding, the local characteristics and the global characteristics of the pictures are comprehensively learned, and therefore the generalization of the model is improved.

In S3, the steps include:

s31: and calculating Jaccard distances between every two pictures according to the output characteristics of all the pictures to obtain an NxN-dimensional distance matrix, wherein N represents the number of the pictures.

S32: based on the distance matrix in S31, clustering is performed by adopting a DBSCAN algorithm, the same pseudo label is given to the sample in one cluster, and the pseudo label of the nearest cluster is given to the outlier generated by clustering.

In S4, the steps include:

s41: in order to make the feature codes belonging to the same pedestrian picture more compact and the feature codes belonging to different pedestrian pictures more distant, the invention adopts InfonCE loss to construct a loss function in the form of

Wherein q represents the pedestrian coding feature to be queried, k ⁺ Represents positive samples of the same class as q, { k } ¹ ,k ² ,k ³ ,…,k ^k Represents a candidate set of K pedestrian categories, respectively.

S42: cluster loss is a cluster-level Info loss that describes each cluster using a representation of a variable (e.g., an intra-class average feature vector) to produce a cluster-level memory dictionary and compute the contrast loss at the cluster level. In this way, the consistency of each category can be effectively maintained in the whole model training process, and the GPU memory consumption can be significantly reduced.

S43: in order to reduce the use of the memory, the invention adopts Cluster loss to calculate the loss of the clustering result

Simultaneously, a Cluster loss function with difficult sample loss updating is provided

The invention combines the two loss functions, thereby giving consideration to the clustering result and the individual difference

μ is the balance factor and is set to 0.5.

S44: using memory modulesThe method of (1) stores the difficult samples in the memory, updates the difficult sample set after each iteration, and has the mode of k [ j]＝μκ[j]-(1-μ)f ^hard I.e. updating the difficult positive samples to which class j belongs, f ^hard Indicates that the current batch is on k [ j ]]Hard samples of (a) i.e. feature vectors and k j]The sample with the largest cosine distance difference, μ is the momentum update coefficient, and is set to 0.5.

S45: further, a loss function based on hard sample mining

Wherein C is the number of categories after the pedestrian features are clustered, q is the feature vector of the current sample, and k ⁺ Is the feature vector, k, of the hard sample of the class to which sample q belongs ⁱ Tau is a temperature coefficient and is set to be 0.2 for the feature vector of the class to which each difficult sample belongs, is used for controlling the scale of similar samples,<>indicating that the cosine distance between the two vectors is calculated.

S46, storing the clustering centers of each category in a memory mode by using a memory module, wherein the computing mode of the clustering centers is the average value of all eigenvectors with the same pseudo label after clustering, and the clustering centers are updated after each iteration, and the specific mode is c ⁱ ＝αc ⁱ +(1-α)c ^i-1 ,c ⁱ Representing the cluster center obtained by the i-th iteration calculation, c ^i-1 Representing the cluster center obtained by the calculation of the (i-1) th round, and alpha is a momentum update coefficient and is set to be 0.75.

S47: further, cluster-based penalty functions

Wherein C is the number of categories after the pedestrian features are clustered, q is the feature vector of the current sample, and C ⁺ As the cluster center of the class to which the sample q belongs, c ⁱ Tau is a temperature coefficient set to 0.07 for the feature vector of each cluster center, is used for controlling the scale of similar samples,<>indicating that the cosine distance between two vectors is calculated.

S48: inputting the data set picture into the network for training, and setting training parameters, comprising: setting the number of nodes of a randomly lost hidden layer during each training, setting the times of sample training in all training sets, setting learning rate, selecting an optimizer, judging whether loss is converged according to a loss curve, and the like. When the value of the loss function is not changed greatly, the convergence of the algorithm can be judged, and the network parameters at the moment are stored, so that a trained model is obtained.

S5: the method comprises the steps of carrying out pedestrian re-identification by using a trained network model, inputting a to-be-identified pedestrian picture and a video, extracting pedestrians in the video to construct a candidate pedestrian picture set, calculating the cosine distance between each pedestrian in the candidate pedestrian picture set and the to-be-identified pedestrian picture, sequencing the pedestrians in the picture according to the distances, representing the similarity degree between the pedestrians and the picture, sequencing the candidate pedestrian picture set according to the similarity degree with high similarity degree corresponding to the candidate picture extracted by the video, outputting the previous M most similar pictures, and finishing the pedestrian re-identification.

According to the steps of the pedestrian re-identification method, the learning rate is set to be 0.04, the whole training process is terminated after 80 rounds, and the identification performance of the pedestrian re-identification method is tested.

When MarKet-1501 is used as a training set and DukeMTMC-reiD is used as a verification set, the Rank-1 of the method is 0.74442 and the mAP is 0.61148. When the model was trained and validated using only MarKet-1501, rank-1 was 0.80581 and mAP was 0.63435.Rank-1 is the result accuracy of the 1 st graph in the recognition result, also called the first matching rate, and mAP is the average accuracy mean value, which is obtained by summing the average accuracies in the multi-classification task and then averaging. The result data shows that the method provided by the invention has good generalization capability and accuracy.

Claims

1. The unsupervised pedestrian re-identification method based on the occlusion key area is characterized by comprising the following steps of:

s1: obtaining pedestrian picture data set without label

Where N represents the number of pictures in the data set and Xi represents the ith picture in the data setPedestrian pictures, adjusting the size of each picture to be the same in height and width, and preprocessing;

s3: clustering the extracted features to obtain a pseudo label;

s4: updating a clustering center and a hard sample set according to a clustering result, calculating total loss, returning and updating network parameters in a gradient manner, and storing optimal parameters of the network;

s5: and (4) carrying out pedestrian re-identification by using the trained network model, inputting the picture and the video of the pedestrian to be inquired, and outputting pedestrian re-identification information.

2. The unsupervised pedestrian re-identification method based on the occlusion key area as claimed in claim 1, wherein the preprocessing is to adjust all the pictures to the same size, and perform enhancement processing on the pictures, including three modes of horizontal turning, rotation angle and standardization.

3. The unsupervised pedestrian re-identification method based on occlusion key regions according to claim 1, wherein the deep learning model comprises: attention module and ResNet50 model;

the attention module adopts a space attention module in a CBAM model, outputs a matrix with the same size as the picture and is used for reflecting an image area concerned by the current network model;

performing global maximum pooling and global average pooling based on the width and height of the feature map, converting feature dimensions from 64x128 to 1x1; reducing the dimensionality of the feature map after convolution with a convolution kernel of 7x7 and a Relu activation function; after one convolution, the dimension is improved to the original dimension.

4. The unsupervised pedestrian re-identification method based on the occluded key area as claimed in claim 1, wherein the pseudo tag is obtained by using DBSCAN algorithm;

calculating Jaccard distances between every two pictures according to the output characteristics of all the pictures to obtain an NxN-dimensional distance matrix, wherein N represents the number of the pictures;

based on the distance matrix, clustering is carried out by adopting a DBSCAN algorithm, the same pseudo label is given to the sample in one cluster, and the pseudo label of the nearest cluster is given to the outlier generated by clustering.

5. The unsupervised pedestrian re-identification method based on occlusion key areas as claimed in claim 1, wherein the loss function consists of a hard sample based loss function and a cluster based loss function;

will be based on the loss function l of the hard sample _ins And a cluster-based loss function l _cls Perform integration, loss function l _ReId ＝μl _ins +(1-μ)l _cls μ is a balance factor.

6. The unsupervised pedestrian re-identification method based on the occlusion key area as claimed in claim 5, wherein the loss function based on the hard sample is updated in a memory module manner;

storing the difficult samples in a memory, and updating the difficult sample set in a mode of k [ j]＝μk[j]-(1-μ)f ^hard I.e. updates the difficult positive samples to which class j belongs, f ^hard Indicates that the current batch is on k [ j ]]The hard samples of (i.e. feature vector and k [ j ]]The sample with the largest cosine distance difference;

updating the loss function based on the cluster in a memory module mode;

storing the clustering centers of each category in a memory, wherein the computing mode of the clustering centers is the average value of all eigenvectors with the same pseudo label after clustering, and the clustering centers are updated after each iteration, and the specific mode is c ⁱ ＝αc ⁱ +(1-α)c ^i-1 ,c ⁱ Representing the cluster center obtained by the i-th iteration calculation, c ^i-1 The cluster center obtained by the i-1 st round of calculation is shown.

7. The unsupervised pedestrian re-identification method based on occlusion key areas of claim 6, wherein the hard sample-based loss function is in the form of:

wherein C is the number of categories after the pedestrian features are clustered, q is the feature vector of the current sample, and k ⁺ Is the feature vector, k, of the hard sample of the class to which sample q belongs ⁱ Tau is a temperature coefficient set to 0.2 for the feature vector of the class to which each hard sample belongs, and is used for controlling the scale of similar samples,<>indicating that the cosine distance between the two vectors is calculated.

8. The unsupervised pedestrian re-identification method based on the occlusion key area as claimed in claim 5, wherein the loss function form based on the hard sample is:

c is the number of categories after pedestrian feature clustering, q is the feature vector of the current sample, C ⁺ As the cluster center of the class to which the sample q belongs, c ⁱ Is the characteristic vector of each cluster center, tau is a temperature coefficient and is used for controlling the scale of similar samples,<>indicating that the cosine distance between the two vectors is calculated.

9. The unsupervised pedestrian re-identification method based on the occlusion key area as claimed in claim 1, wherein the saving of the optimal parameters of the network comprises: setting the number of nodes of a randomly lost hidden layer during each training, setting the times of sample training in all training sets, setting a learning rate, selecting an optimizer, judging whether loss is converged or not according to a loss curve, and storing a trained model.

10. The unsupervised pedestrian re-recognition method based on the occlusion key area as claimed in claim 1, wherein the step of performing pedestrian re-recognition by using the trained network model comprises the following steps: inputting a pedestrian picture to be identified and a video, extracting pedestrians in the video to construct a candidate pedestrian picture set, calculating the cosine distance between each pedestrian in the candidate pedestrian picture set and the pedestrian picture to be identified, sequencing the pedestrians in the picture according to the distance, sequencing the candidate pedestrian picture set according to the similarity degree, outputting the first M most similar pictures, and completing pedestrian re-identification.