CN109543602B

CN109543602B - Pedestrian re-identification method based on multi-view image feature decomposition

Info

Publication number: CN109543602B
Application number: CN201811388865.4A
Authority: CN
Inventors: 杨晓峰; 李海芳; 邓红霞; 姚蓉; 郭浩
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2018-11-21
Filing date: 2018-11-21
Publication date: 2020-08-14
Anticipated expiration: 2038-11-21
Also published as: CN109543602A

Abstract

The invention relates to the technical field of intelligent image retrieval, in particular to a pedestrian re-identification method based on multi-view image feature decomposition. The characteristics of the same visual angle are directly used for re-identification, and the characteristics of different visual angles need to be subjected to characteristic conversion. Experiments show that the decomposition of the pedestrian images greatly helps to improve the accuracy of pedestrian re-identification.

Description

Pedestrian re-identification method based on multi-view image feature decomposition

Technical Field

The invention relates to the technical field of intelligent image retrieval, in particular to a pedestrian re-identification method based on multi-view image feature decomposition.

Background

With the public security of society getting more and more attention, a large number of monitoring cameras are installed in many important places, and monitoring videos shot by the monitoring cameras can provide clues for the public security department to detect important criminal cases such as terrorist attacks and people fighting. Pedestrian re-identification is an automatic target identification technology, quickly locates an interested human target in a monitoring network, and is an important step in applications such as intelligent video monitoring and human behavior analysis. The pedestrian re-identification technology is mainly used for solving the pedestrian retrieval problem in the safety monitoring field.

Pedestrian re-identification is typically studied from both aspects of feature extraction and distance metric learning. The pedestrian re-identification research based on the feature extraction is a core problem for image identification. And appropriate and robust features are extracted, and the detection result and the execution efficiency are greatly improved. The commonly used features are mainly: HOG characteristics, SIFT characteristics, SURF characteristics, Covariance descriptors, ELF characteristics, Haar-like reproduction characteristics, LBP characteristics, Gabor filters, Co-occurence matrixes and the like.

Pedestrian re-identification studies based on distance metric learning. Due to the fact that the visual angle, the scale, the illumination, the clothing and posture change, the resolution ratio are different and shielding exists, continuous position and motion information may be lost among different cameras, and the similarity of the pedestrian apparent characteristics measured by using the Euclidean distance, the Papanicolaou distance and other standard distance measurement cannot achieve a good re-recognition effect, so that researchers propose a method for measuring the similarity of pedestrians in different images through a measurement learning method. Common distance metric learning algorithms are: LMNN algorithm, PRDC algorithm, LDML algorithm KISSME algorithm, LFDA algorithm, XQDA algorithm.

In 2014, the study of pedestrian re-identification is combined with deep learning, and the pedestrian re-identification based on the image mainly uses a deep convolution neural network method at present. McLaughlin et al performs transfer learning by using a method based on an AlexNet Convolutional Neural Network (CNN) structure, extracts color and optical flow features from an image, obtains high-level representation by using convolutional neural network processing, captures time information by using a Recurrent Neural Network (RNN), and performs pooling to obtain sequence features. Xiao et al trained the same Convolutional Neural Network (CNN) for data from various domains, some neurons learned the shared characterization of various domains, and others were effective for a certain specific region, to obtain a robust CNN feature representation.

In the image-based pedestrian re-identification study, VIPeR, the most widely adopted dataset, increased the accuracy of rank-1 from 12.0% in 2008 by 63.9% in 2015; meanwhile, rank-1 on the data set CUHK01 is improved to 79.9% of 2017; on Market-1501, the application of deep learning significantly improved the accuracy of rank-1, and when the data set was applied to the study of pedestrian re-identification from 2015, the accuracy of rank-1 improved from 44.42% to 82.21% in 2017. As can be seen from the above-described research results, the research of pedestrian re-identification based on images has made a great progress, but there is room for further improvement, and there is mainly a problem of low accuracy. Note: rank-1 indicates that the solution with the highest probability in the decision results is the correct solution.

Disclosure of Invention

The invention aims to provide a pedestrian re-identification method based on multi-view image feature decomposition, which realizes multi-view image feature decomposition (front, side and back) of any pedestrian image through an improved capsule neural network, generates image feature description information and corresponding probability information of any pedestrian image under multi-view angles, and uses the generated pedestrian feature description information and the generated probability information of the features for pedestrian image similarity measurement.

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

a pedestrian re-identification method based on multi-view image feature decomposition comprises the following steps:

s1, selecting a standard image in the pedestrian image dataset, dividing the selected standard image into a training dataset and a testing dataset, and establishing a training dataset and a testing dataset of the multi-view image feature decomposition neural network;

s2, zooming or cutting the pedestrian image in the data set into an RGB image with the resolution of 192 multiplied by 64 to be used as a training set and a testing set;

s3, designing a capsule classification network Capsule eNet into two convolution layers and two capsule layers to obtain an improved capsule network;

s4, training the improved capsule classification network in S3 on a pedestrian image training set at a standard visual angle in S2, constructing a multi-visual-angle image feature decomposition neural network, and generating the similarity and image feature vectors of any pedestrian image at three standard visual angles;

s5, generating a cross-view characteristic transformation matrix for pedestrian image characteristic transformation at different views, and calculating a cross-view characteristic transformation coefficient matrix for reducing cross-view pedestrian image characteristic measurement loss;

s6, selecting a pedestrian image similarity measurement function, realizing pedestrian re-identification, realizing same-view angle feature comparison and cross-view angle feature comparison, and using the pedestrian image similarity measurement function in pedestrian re-identification.

Further, the standard images in S1 are a front image, a side image, and a back image with a standard shooting angle.

Further, the improved capsule network in S3 is a three-classification network.

Further, the first layer of the modified capsule network CapsuleNet in S3 is set as a convolution layer, and the parameters are as follows: 5 multiplied by 5 of convolution kernel, 32 of output channels, B multiplied by 3 multiplied by 192 multiplied by 64 of input dimensionality, and B is the quantity of ultrasonic parameter batch processing; the second layer is set as a convolution layer, and the parameters are as follows: convolution kernel 5 × 5, output channel number 256; the third layer is a primary capsule layer, and the parameters are as follows: convolution kernel 5 × 5, output channel number 8; the fourth layer is a coding capsule layer, and the parameters are as follows: the number of output channels is 3, the routing parameters are 2, the output dimension is Bx 3 x 64, and B is the batch processing number of the super parameters.

Further, the convolutional layer can be replaced by a single layer neural network, ResNet, HourglassNet, or other convolutional network.

Further, the similarity calculation formula in S4 is:

where e is a natural constant (e ≈ 2.71828), V represents the length of a feature vector mode, i represents the ith pedestrian image, and V_ijRepresenting the length of the eigenvector mode, S, at a certain (j) view angle for the ith pedestrian_ijRepresenting the similarity of the ith pedestrian at viewing angle j.

Further, the generating of the cross-perspective feature transformation matrix in S5 specifically includes the following steps:

(1) establishing a data set, wherein the data set comprises characteristic pairs { V ] of the same pedestrian from different visual angles_i,U_iFeature V_iAnd U_iHave the same dimension D, wherein V_i，U_iRepresenting pedestrian feature codes under a certain view angle;

(2) establishing and training a feature transformation network, wherein the feature transformation network is a two-layer BP neural network, and the data of a network input layer is V_i Input layer dimension 1 × D, network output layer data is U_i', output layer dimension 1 × D, loss function: loss (Loss) (x) 1-cos (U)_i’,U_i) Wherein U is_i' denotes a network switching feature, U_iTo representA target feature;

(3) after the feature transformation network is trained, the loss function is reduced to be below 0.07, and transformation matrixes W and U are extracted from the feature transformation network_i’＝W*V_iW is a matrix of D × D;

(4) and (4) repeating the step (2) and the step (3), and calculating all the image characteristic conversion matrixes of the non-viewing angles once to obtain all the cross-viewing angle characteristic conversion matrixes.

Further, the cross-perspective feature transformation coefficient is calculated in S5 as

Wherein t is the average loss function value;

the transformation coefficient matrixes formed by the transformation coefficients of different visual angles are as follows:

wherein A is₁₂Coefficient of transformation, A, representing the frontal to lateral characteristics of a pedestrian₁₃Coefficient of conversion, A, representing the front to back features of a pedestrian₂₁Representing the conversion coefficient from the lateral feature to the frontal feature of the pedestrian, A₂₃Coefficient of transformation, A, representing the lateral to dorsal aspect of a pedestrian₃₁Representing the conversion coefficient of the pedestrian's back features to front features, A₃₂Representing the conversion coefficient of the pedestrian back feature to the side feature.

Further, the implementation of pedestrian re-identification in S6 specifically includes the following steps:

(1) selecting a cos function to measure the pedestrian characteristic distance, wherein the pedestrian characteristic distance measure is as follows: f (X, Y) ═ Alpha × cos (X, Y), where Alpha is the conversion coefficient;

(2) compared with the pedestrian characteristic at the view angle, the pedestrian characteristic distance is as follows: l ═ f (X)_i,Y_j) Wherein X is_j、Y_jDecomposing features for the multi-view image features with the maximum probability of any pedestrian image;

(3) compared with the pedestrian characteristic across the visual angles, the pedestrian characteristic distance is L ═ f (W × X)_i,Y_j) Wherein X is_j、Y_jDecomposing the characteristic for the multi-view image characteristic with the maximum probability of any pedestrian image, wherein W is a conversion matrix;

(4) and performing descending sorting on the distance measurement results of the searched pedestrian images, wherein the pedestrian image in the front in the sorting result is the pedestrian image re-identification result.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a pedestrian re-identification method based on multi-view image feature decomposition, which is characterized in that a pedestrian image multi-view image feature generation network is constructed by combining a capsule network from the problem of target image classification, and any pedestrian image is decomposed to obtain multi-view image features and the similarity under the view angle. The characteristics of the same visual angle are directly used for re-identification, and the characteristics of different visual angles need to be subjected to characteristic conversion. Experiments show that the decomposition of the pedestrian images greatly helps to improve the accuracy of pedestrian re-identification.

Drawings

FIG. 1 is a schematic diagram of multi-view image features of a pedestrian image;

FIG. 2 is a schematic illustration of a pedestrian image dataset;

FIG. 3 is a standard capsule network;

FIG. 4 is a network structure diagram of pedestrian image multi-view image feature generation;

FIG. 5 is a method of comparing iso-view features;

FIG. 6 illustrates a method for comparing features from different viewing angles;

FIG. 7 is a pedestrian feature transformation diagram;

FIG. 8 shows the results of example data tests;

fig. 9 shows the result of pedestrian re-recognition detection.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A pedestrian re-identification method based on multi-view image feature decomposition is carried out according to the following steps:

firstly, constructing a multi-view image feature decomposition neural network to realize the description of pedestrian images from different views (as shown in figure 1)

1. Training data set and testing data set for establishing multi-view image feature decomposition neural network

In the pedestrian image dataset, a standard front image and a side image are selected, a back image (as shown in fig. 2, in the figure, (a) is the front image of a pedestrian, (b) is the side image of the pedestrian, and (c) is the back image of the pedestrian) is not selected, and the selected standard image is divided into a training data set and a testing data set as shown in fig. 2. 1713 front pedestrian images, 1737 side pedestrian images and 1722 back pedestrian images in the training data set, wherein the total number is 5172; the test data set consisted of 170 front pedestrian images, 256 side pedestrian images, and 124 back pedestrian images, for a total of 550.

2. Data set preprocessing

The pedestrian image in the data set is scaled or cropped to an RGB image with a resolution of 192 x 64.

3. Constructing a multi-view image feature decomposition neural network by improving the structure of a capsule classification network Capsule Net

The first layer of the standard capsule network Capsule Net is a convolution layer, and the convolution kernel is 7 multiplied by 7; the second layer is a primary capsule layer, the convolution kernel is 7 multiplied by 7, and the number of output channels is 8; the third layer is a coding capsule layer, the number of output channels is 10, the number of routes is 3, the output dimension is B × 10 × 16, and B is the number of batch processing of the super parameters, as shown in fig. 3.

The first layer of the improved capsule network CapsuleNet is set as a convolution layer, and the parameters are as follows: 5 multiplied by 5 of convolution kernel, 32 of output channels, B multiplied by 3 multiplied by 192 multiplied by 64 of input dimensionality, and B is the quantity of ultrasonic parameter batch processing; the second layer is set as a convolution layer, and the parameters are as follows: convolution kernel 5 × 5, output channel number 256; the third layer is a primary capsule layer, and the parameters are as follows: convolution kernel 5 × 5, output channel number 8; the fourth layer is a coding capsule layer, and the parameters are as follows: the number of output channels is 3, the routing parameters are 2, the output dimension is Bx 3 x 64, and B is the batch processing number of the super parameters. As shown in fig. 4.

The first two layers of the improved capsule network CapsuleNe are feature extraction layers, and the feature extraction layers can be other feature extraction networks such as a single-layer neural network, ResNet, HourglassNet and the like. From experimental results, the complex feature extraction network can improve the result of pedestrian re-identification by 1% -3%.

4. Training multi-view image feature decomposition neural network

Training is carried out on a standard visual angle pedestrian image training set, the training iteration time epoch is 10, the accuracy rate of the multi-visual angle image feature decomposition neural network is 98% when the training is finished, and the overall loss function value is 0.016.

5. Obtaining multi-view pedestrian multi-view image features and corresponding similarity

Inputting the preprocessed image into a multi-view image feature decomposition neural network, performing multi-view image feature decomposition on the image through the multi-view image feature decomposition neural network to obtain image feature values of different views and corresponding similarities, and performing Softmax calculation on the image feature values of the same pedestrian image under different views to obtain corresponding similarities P_i1，P_i2，P_i3Where i represents the ith pedestrian image. As shown in fig. 1.

The visual angle similarity calculation formula is as follows:

secondly, determining a characteristic transformation matrix under different visual angles and calculating transformation coefficients

1. Determining a characteristic transformation matrix by using a BP neural network, and solving the characteristic transformation matrix by adopting the following steps:

step 1, establishing a data set, wherein the data set comprises feature pairs { V } of the same pedestrian in different visual angles_i,U_iFeature V_iAnd U_iHave the same dimension D, wherein V_i，U_iRepresenting the pedestrian feature coding under a certain view angle, in the embodiment, D is selected to be 64, and dimension D may also be selected from other values: 128, 256, etc.;

step 2, schematically showing according to fig. 5, establishing and training a feature transformation network, wherein the feature transformation network is a two-layer BP neural network, and the data of a network input layer is V_i Input layer dimension 1 × D, network output layer data is U_i', output layer dimension 1 × D, loss function: loss (Loss) (x) 1-cos (U)_i’,U_i) Wherein U is_i' denotes a network switching feature, U_iRepresenting target features

Step 3, after the feature transformation network is trained, the loss function is reduced to be below 0.07, and transformation matrixes W and U are extracted from the feature transformation network_i’＝W*V_iW is a matrix of D × D, in this embodiment W is a matrix of 64 × 64;

and 4, repeating the step 2 and the step 3, and calculating all the image characteristic conversion matrixes of the non-visual angles once to obtain all the cross-visual angle characteristic conversion matrixes.

2. Computing a feature transformation coefficient matrix

The features are converted through the cross-perspective feature conversion network, a certain probability conversion error exists in the feature conversion process, and the value of a loss function of the cross-perspective feature conversion network is not zero, so that the result of the cross-perspective image feature measurement must be multiplied by a conversion coefficient Alpha, and the conversion coefficient Alpha is used for reducing the error of feature conversion.

The loss function value of the cross-view pedestrian feature conversion network represents the difference degree of different view angle feature conversion, the average loss function value is taken as t, and the value of the conversion coefficient is taken as

Wherein t is the average loss function value;

In this embodiment, the transformation coefficient matrix is:

thirdly, selecting a pedestrian image similarity measurement function to realize pedestrian re-identification

Selecting a cos function to measure the pedestrian characteristic distance, wherein the pedestrian characteristic distance measure is as follows:

f(X,Y)＝Alpha×cos(X,Y)

judging according to the similarity of the multi-angle features of any two pedestrian images (as shown in figures 6 and 7), and if the feature with the maximum similarity is the same-view feature, selecting a direct comparison mode, wherein the pedestrian feature distance is L ═ f (X)_i，Y_j) If the characteristic with the maximum similarity is not the same-view characteristic, selecting a characteristic conversion and comparison mode, wherein the pedestrian characteristic distance is L-f (W × X)_i，Y_j) W is the transformation matrix and Alpha is the transformation coefficient.

When the pedestrians are re-identified, any pair of pedestrian images (X, Y) are selected from the searched pedestrian image set, the value of the pedestrian characteristic distance measurement function is calculated, the distance measurement results of the searched pedestrian images are sorted in a descending order, and the pedestrian image in the front of the sorting results is the re-identification result of the pedestrian images.

Fourth, determining the sample size of experiment

In the embodiment, 28193 pedestrian image samples are selected, wherein 27145 training samples and 1048 testing samples are selected.

Fifth, the pedestrian re-identification experiment effect

On the data set selected in this embodiment, the feature vector D selects 4 dimensions, such as 16 dimensions, 32 dimensions, 48 dimensions, 64 dimensions, and the like, respectively, and the Rank1, Rank5, Rank10, Rank15, Rank20, and Rank indicate that the Rank contains a correct search image within a certain range. From the test results, it can be seen that in the re-recognition result, when 64 bits are selected for the image features of the pedestrians, one-time hit Rank1 reaches over 88%, the first five hit Rank5 reaches over 95%, and the first 20 hit Rank20 reaches over 98%. The pedestrian re-identification detection result is shown in fig. 9, and the data test result is shown in fig. 8.

The method regards the generation of multi-view image features as a multi-classification problem. In the multi-classification problem, the probability of each classification is generated after the last layer of SoftMax normalization processing of the multi-classification network, and the probability represents the similarity of the input and various classifications. For example: after the results of the multi-classification network are subjected to Softmax normalization processing, the probability of belonging to the class 1 is 50%, the probability of belonging to the class 2 is 49%, and the probability of belonging to the class 3 is 1%, so that the classification results show that the similarity with the class 1 is 50%, the similarity with the class 2 is 49%, and the similarity with the class 3 is only 1%.

The invention realizes the 3-view image feature decomposition neural network by means of the Capsule Multi-classification network. Firstly, perspective classification of the pedestrian image is realized by utilizing a Capsule Net multi-classification network, three probability values in a classification result are the similarity with a standard perspective, for example, the probability of a front perspective in the classification result is 60%, which shows that the similarity of the pedestrian image and a standard front image is 60%, and 60% of the image contains the front image characteristics of the pedestrian. Secondly, because the capsuleNet multi-classification network can generate the feature vectors corresponding to the classifications, the method can generate the similarity of three standard visual angles and also generate the image features under the three standard visual angles through the capsuleNet multi-classification network, and the image similarity is judged according to the similarity of the three standard visual angles and the corresponding image features in the method.

In this embodiment, the Cosine distance is used as a basic discrimination method for discriminating the similarity of the pedestrian images. In the process of calculating the similarity of the pedestrian images, if the multi-view classification results of the two pedestrian images are the same, namely the view angle with the maximum probability is the same view angle classification, the similarity of the pedestrian images adopts the corresponding image characteristics under the view angle to carry out Cosine calculation, and the calculated result is the similarity of the pedestrian images; if the multi-view classification results of the two pedestrian images are different, namely the view with the maximum probability is not classified in the same view, the pedestrian image similarity calculation adopts the cross-view conversion of the image features corresponding to a certain view (with the maximum probability) and then the Cosine calculation, and the calculation result is the pedestrian image similarity. Certain errors exist in pedestrian image similarity Cosine calculation through cross-view image feature conversion, and a conversion coefficient Alpha is multiplied before a conversion result and is used for reducing the conversion errors.

In the embodiment, the pedestrian image feature cross-view angle conversion is calculated in a form of conversion matrix multiplication, and the pedestrian image feature cross-view angle is obtained by multiplying the view angle conversion matrix and the original view angle pedestrian image feature vector. The view transformation matrix is obtained by extracting coefficients in a view transformation neural network, and in the embodiment, there are 3 standard views, and 6 view transformation matrices are required in total.

Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.

Claims

1. A pedestrian re-identification method based on multi-view image feature decomposition is characterized by comprising the following steps:

s5, generating a cross-view characteristic transformation matrix, and calculating a cross-view characteristic transformation coefficient matrix;

s6, selecting a pedestrian image similarity measurement function to realize pedestrian re-identification;

the implementation of pedestrian re-identification in S6 specifically includes the following steps:

2. The pedestrian re-identification method based on multi-view image feature decomposition according to claim 1, wherein: the standard images in S1 are front images, side images, and back images with standard shooting angles.

3. The pedestrian re-identification method based on multi-view image feature decomposition according to claim 1, wherein: the improved capsule network in the S3 is a three-classification network.

4. The pedestrian re-identification method based on multi-view image feature decomposition according to claim 1, wherein: the first layer of the improved capsule network CapsuleNet in S3 is set as a convolution layer, and the parameters are as follows: 5 multiplied by 5 of convolution kernel, 32 of output channels, B multiplied by 3 multiplied by 192 multiplied by 64 of input dimensionality, and B is the quantity of ultrasonic parameter batch processing; the second layer is set as a convolution layer, and the parameters are as follows: convolution kernel 5 × 5, output channel number 256; the third layer is a primary capsule layer, and the parameters are as follows: convolution kernel 5 × 5, output channel number 8; the fourth layer is a coding capsule layer, and the parameters are as follows: the number of output channels is 3, the routing parameters are 2, the output dimension is Bx 3 x 64, and B is the batch processing number of the super parameters.

5. The pedestrian re-identification method based on multi-view image feature decomposition according to claim 1 or 4, wherein: the convolutional layer may be replaced with a single layer neural network, ResNet, or HourglassNet.

6. The pedestrian re-identification method based on multi-view image feature decomposition according to claim 1, wherein the similarity calculation formula in S4 is as follows:

where e is a natural constant, V represents the length of the feature vector mode, i represents the ith pedestrian image, and V_ijRepresenting the length of the eigenvector mode, S, at a certain view angle, j, of the ith pedestrian_ijRepresenting the similarity of the ith pedestrian at viewing angle j.

7. The pedestrian re-identification method based on multi-view image feature decomposition according to claim 1, wherein the step of generating the cross-view feature transformation matrix in S5 specifically comprises the following steps:

(1) establishing a data set, wherein the data set comprises characteristics of the same pedestrian at different visual anglesFor { V_i,U_iFeature V_iAnd U_iHave the same dimension D, wherein V_i，U_iRepresenting pedestrian feature codes under a certain view angle;

(2) establishing and training a feature transformation network, wherein the feature transformation network is a two-layer BP neural network, and the data of a network input layer is V_iInput layer dimension 1 × D, network output layer data is U_i', output layer dimension 1 × D, loss function: loss (Loss) (x) 1-cos (U)_i’,U_i) Wherein U is_i' denotes a network switching feature, U_iRepresenting pedestrian feature codes under a certain view angle;

(4) and (4) repeating the step (2) and the step (3), and calculating all the image characteristic conversion matrixes of different visual angles once to obtain all the cross-visual angle characteristic conversion matrixes.

8. The pedestrian re-identification method based on multi-view image feature decomposition according to claim 1, wherein: the cross-view characteristic conversion coefficient is calculated in the step S5 as

Wherein t is the average loss function value;

wherein A is₁₂Coefficient of transformation, A, representing the frontal to lateral characteristics of a pedestrian₁₃Coefficient of conversion, A, representing the front to back features of a pedestrian₂₁Representing the conversion coefficient from the lateral feature to the frontal feature of the pedestrian, A₂₃Coefficient of transformation, A, representing the lateral to dorsal aspect of a pedestrian₃₁Representing the conversion coefficient of the pedestrian's back features to front features, A₃₂Line of representationThe conversion factor of the human back features to the side features.