CN109271895B

CN109271895B - Pedestrian re-identification method based on multi-scale feature learning and feature segmentation

Info

Publication number: CN109271895B
Application number: CN201811007656.0A
Authority: CN
Inventors: 何立火; 邢志伟; 高新波; 王智康; 路文; 李琪琦; 张怡; 钟炎喆; 武天妍
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-08-31
Filing date: 2018-08-31
Publication date: 2022-03-04
Anticipated expiration: 2038-08-31
Also published as: CN109271895A

Abstract

A pedestrian re-identification method based on multi-scale feature learning and feature segmentation mainly solves the problems that in the prior art, the representation is poor due to only two scales, and errors are caused by inaccurate human body feature extraction by human body part identification. The method comprises the following specific steps: (1) constructing a multi-scale feature learning module; (2) constructing a feature segmentation module; (3) constructing a feature learning network; (4) preprocessing a video containing a pedestrian; (5) training a feature learning network; (6) calculating a characteristic distance; (7) obtaining a matching image; the invention utilizes the multi-scale feature learning module to extract the multi-scale features of the pedestrian image, and utilizes the feature segmentation module to extract the local features of the global feature and the coarse and fine granularities, and the extracted features have good distinguishability and high robustness, so that the pedestrian re-identification achieves higher precision.

Description

Pedestrian re-identification method based on multi-scale feature learning and feature segmentation

Technical Field

The invention belongs to the technical field of image processing, and further relates to a pedestrian re-identification method based on multi-scale feature learning and feature segmentation in the technical field of image identification. The invention can be used for identifying whether the pedestrian images obtained by the monitoring videos of different cameras at different angles are the same pedestrian or not.

Background

With the continuous development of the current society, the social public safety becomes a hot topic of people's attention, a large number of monitoring cameras are installed in the public field, massive video data are generated every day, and the intelligent analysis of the data becomes a hot research topic. The pedestrian re-identification is to compare a certain target pedestrian appearing under the camera with all pedestrians under other cameras in the monitoring network, accurately and rapidly identify the target, find all target pedestrian images under all cameras, and then can realize the tracking and positioning of the target pedestrian across the cameras. The pedestrian re-identification is realized by judging whether the pedestrian is the same pedestrian target through comparison among pedestrian images, but because the angles of different cameras are different, the monitoring scene is complex, the background and the illumination are different, and the obtained postures, appearances and the like of the pedestrians are also different; in addition, the shielding of pedestrians or the shielding of pedestrians and other objects in different degrees brings great challenges to the re-identification of the pedestrians.

Shanghai university of transportation proposed a multi-scale feature-fused pedestrian comparison method in its patent document, "a multi-scale feature-fused pedestrian comparison method" (patent application No. 201410635897.5, application publication No. CN 104376334A). The method comprises the following implementation steps: establishing a pedestrian set; extracting color and contour features by using the low-scale image, and cascading and fusing to obtain low-scale features; carrying out semi-supervised SVM learning on the low-scale features, and carrying out first pedestrian comparison screening to obtain a candidate pedestrian set; calculating the similarity between each pedestrian in the screened candidate pedestrian set and the target pedestrian by utilizing the high-scale image and adopting a comparison algorithm based on local feature points; and superposing the pedestrian similarity on the two scales to obtain the final sorting result of the candidate pedestrians in the screened candidate pedestrian set. The method has the disadvantages that only two manually designed scale features of high scale and low scale are extracted, the scale is small, the representation is poor, the manually designed features are not high in universality, and targets are easily missed.

The chinese metrological institute has proposed a pedestrian re-identification method based on transfer learning in the patent document "pedestrian re-identification method based on transfer learning" (patent application No. 201510445055.8, application publication No. CN 105095870A). The method comprises the following steps: segmenting a pedestrian foreground, extracting pedestrian features, learning a source domain model, migrating and learning a target domain, and measuring pedestrian distance; firstly, selecting a pedestrian target in a video by using a GrabCont algorithm; then, dividing the human body into five regions of a head, left and right upper limbs and left and right legs by using a human body symmetry model, and extracting color, edge and texture characteristics; training and optimizing a neural network model by using pedestrian data of a source domain; on the basis of the model parameters, target domain data are utilized for transfer learning; and finally, comparing the pedestrians by using the neural network model improved by the target domain to obtain the sequencing of the pedestrian distance measurement, and finally obtaining the result of pedestrian re-identification. The method has the disadvantages that the postures of pedestrians are different in video monitoring, the accuracy of recognizing human body parts by using the human body symmetric model is difficult to guarantee, the extracted regional characteristics of the head, the limbs and the like of the human body are not accurate, and the accuracy of the pedestrian re-recognition result is low due to errors caused by the extracted regional characteristics.

Disclosure of Invention

The invention aims to provide a pedestrian re-identification method based on multi-scale feature learning and feature segmentation, aiming at the defects of the prior art.

The idea for realizing the purpose of the invention is to construct a multi-scale feature learning network to extract the multi-scale features of the pedestrian image, construct a feature segmentation module, further extract the local features of global and coarse granularity of the features under different scales by utilizing the module, and adaptively fuse the local features of global and coarse granularity under all scales, so that the extracted features are more distinguishable and more robust, thereby improving the algorithm precision.

The method comprises the following specific steps:

(1) constructing a multi-scale feature learning module:

(1a) an 11-layer multi-scale feature learning module is built, and the structure of the module is as follows in sequence: input layer → convolutional layer → max pooling layer → eight hourglass modules; the hourglass module is structurally provided with ten serially-connected residual blocks, wherein the output of a first residual block is connected with the input of a tenth residual block, the output of a second residual block is connected with the input of a ninth residual block, the output of a third residual block is connected with the input of an eighth residual block, and the output of a fourth residual block is connected with the input of a seventh residual block;

(1b) setting parameters of each module of the multi-scale feature learning module;

(2) constructing a feature segmentation module:

(2a) build eight 4 layers of characteristic and cut apart the module, its structure does in proper order: the characteristic segmentation layer → the global pooling fusion layer → the full convolution layer → the SoftMax classification layer;

(2b) the parameters of each layer of the characteristic segmentation module are set as follows: setting 1792 feature maps output by the pooling fusion layer and 256 output feature maps of the full convolution layer;

(3) constructing a feature learning network:

connecting the output of each hourglass module in the multi-scale feature learning module with the input of each feature segmentation module in a one-to-one manner, and connecting the outputs of the seventh, eighth, ninth and tenth residual blocks in each hourglass module with the input of each feature segmentation module in a four-to-one manner;

(4) preprocessing a video containing pedestrians:

(4a) extracting a continuous video image containing multiple pedestrians from video images shot by a camera, selecting a frame of image from each video image, cutting out an image of an area occupied by each pedestrian from each frame of video image, forming a pedestrian image set A by all the cut images, and uniformly setting the sizes of the pedestrian images in the pedestrian image set A to be 384 multiplied by 124 pixels;

(4b) marking all images of the same pedestrian in the pedestrian image set A as real labels of images of the same class of pedestrians, wherein each class at least comprises one pedestrian image, and forming a pedestrian image training set by all the images of the pedestrians with the real labels;

(4c) extracting a continuous video image containing multiple pedestrians from video images shot by a camera, selecting a frame of image from each video image, cutting out an image of an area occupied by each pedestrian from each frame of video image, forming a pedestrian image set B by all the cut images, and uniformly setting the sizes of the pedestrian images in the pedestrian image set B to be 384 multiplied by 124 pixels;

(4d) randomly selecting one pedestrian image from the pedestrian image set B as an inquiry target pedestrian image, and taking the rest pedestrian images in the pedestrian image set B as a candidate pedestrian image set;

(5) training a feature learning network:

(5a) inputting a pedestrian image training set into a feature learning network, taking the probability distribution output by the eighth feature segmentation module SoftMax classification layer as the prediction probability distribution of each pedestrian image, and taking the category to which the maximum value in the prediction probabilities belongs as the prediction label of the pedestrian image;

(5b) calculating the cross entropy of a predicted label of each pedestrian image in the pedestrian image training set and a corresponding real label of the pedestrian image by using a label smoothing cross entropy formula, and taking the sum of all cross entropies as a loss value of the feature learning network;

(5c) training a feature learning network by using a random gradient descent method;

(6) calculating the characteristic distance:

(6a) inputting the inquiry target pedestrian image and each pedestrian image in the candidate pedestrian image set into a feature learning network, and taking a feature mapping image output by a full convolution layer of an eighth feature segmentation module in the feature learning network as each pedestrian image feature;

(6b) calculating the characteristic distance between the image characteristic of the query target pedestrian and the image characteristic of each pedestrian in the candidate pedestrian set by utilizing an Euclidean distance formula;

(7) obtaining a matching image:

and sequencing the pedestrian images in the pedestrian candidate set according to the ascending order of the characteristic distance, and taking the first 20 images as matching images for pedestrian re-identification.

Compared with the prior art, the invention has the following advantages:

firstly, because the invention constructs a multi-scale feature learning module to extract the multi-scale features of the pedestrian image, the features of multiple scales can more fully represent the pedestrian image under different resolutions, and the problems of poor representation and low universality caused by artificial design features in the prior art caused by only two scales are solved, so that the invention has the advantages of good feature representation and high universality.

Secondly, the invention constructs a feature segmentation module to extract two local features with different granularities, namely global feature and thickness under different scales, fully utilizes global and local information of the pedestrian image, and overcomes the problem of error caused by inaccurate human body part identification in the prior art, so that the invention has the advantages of low complexity and high identification precision.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic view of an hourglass module of the present invention;

FIG. 3 is a simulation diagram of the present invention.

Detailed Description

The present invention will be described in further detail with reference to fig. 1.

The implementation steps of the present invention are described in further detail with reference to fig. 1.

Step 1, constructing a multi-scale feature learning module.

An 11-layer multi-scale feature learning module is built, and the structure of the module is as follows in sequence: input layer → convolutional layer → max pooling layer → eight hourglass modules; the hourglass module is structurally provided with ten serially-connected residual blocks, wherein the output of a first residual block is connected with the input of a tenth residual block, the output of a second residual block is connected with the input of a ninth residual block, the output of a third residual block is connected with the input of an eighth residual block, and the output of a fourth residual block is connected with the input of a seventh residual block;

the residual block is nine layers, and the structure thereof is as follows in sequence: the first batch normalization layer → the first ReLU layer → the first convolution layer → the second batch normalization layer → the second convolution layer → the third batch normalization layer → the third ReLU layer → the third convolution layer, the first batch normalization layer and the third convolution layer are connected to form the output of the residual block; the feature maps of the first convolutional layer in the residual block are set to 64, the convolutional kernel size is set to 3 × 3 pixels, the step size is set to 1 pixel, the feature map of the second convolutional layer in the residual block is set to 256, the convolutional kernel size is set to 1 × 1 pixel, the step size is set to 1 pixel, the feature map of the third convolutional layer in the residual block is set to 256, the convolutional kernel size is set to 3 × 3 pixels, and the step size is set to 1 pixel.

And setting parameters of each module of the multi-scale feature learning module.

The parameter setting of each module of the multi-scale feature learning network is as follows: setting the total number of convolutional layer feature mapping graphs in the multi-scale feature learning network to be 64, setting the size of a convolutional kernel to be 7 multiplied by 7 pixels, and setting the step size to be 2 pixels; the step size of the maximum pooling layer in the multi-scale feature learning network is set to 2 pixels.

And 2, constructing a feature segmentation module.

Build eight 4 layers of characteristic and cut apart the module, its structure does in proper order: feature segmentation layer → global pooling fusion layer → full convolution layer → SoftMax classification layer.

The parameters of each layer of the characteristic segmentation module are set as follows: the feature maps output by the pooling layer are set to 1792, and the output feature maps of the full convolution layer are set to 256.

And 3, constructing a feature learning network.

And connecting the output of each hourglass module in the multi-scale feature learning module with the input of each feature segmentation module in a one-to-one manner, and connecting the outputs of the seventh, eighth, ninth and tenth residual blocks in each hourglass module with the input of each feature segmentation module in a four-to-one manner.

The cascading of the hourglass modules and the feature segmenting modules of the present invention is described in further detail with reference to figure 2.

The ten rectangles in fig. 2 represent the ten residual blocks, which make up the hourglass-shaped hourglass module. One rounded rectangle in fig. 2 represents the feature segmentation module.

And taking the outputs of the seventh, eighth, ninth and tenth residual error blocks as the outputs of each hourglass module, wherein the output of each hourglass module is connected with the input of one characteristic segmentation module.

Inputting a feature mapping chart output by a seventh, eighth, ninth and tenth residual block in each hourglass module in a multi-scale feature learning module in a feature learning network into a feature segmentation module; dividing each feature mapping graph input into a feature segmentation layer in a feature segmentation module into 1 part, 2 parts and 4 parts horizontally and inputting the divided feature mapping graphs into a global pooling fusion layer; in the global pooling fusion layer, a global maximum pooling operation and a global average pooling operation are respectively used for the feature maps, and the feature maps output by the two pooling operations are added to serve as the output of the pooling fusion layer.

And 4, preprocessing the video containing the pedestrians.

Extracting a continuous video image containing multiple pedestrians from video images shot by a camera, selecting a frame of image from each video image, cutting out an image of an area occupied by each pedestrian from each frame of video image, forming a pedestrian image set A by all the cut images, and uniformly setting the sizes of the pedestrian images in the pedestrian image set A to be 384 multiplied by 124 pixels;

marking all images of the same pedestrian in the pedestrian image set A as real labels of images of the same class of pedestrians, wherein each class at least comprises one pedestrian image, and forming a pedestrian image training set by all the images of the pedestrians with the real labels;

extracting a continuous video image containing multiple pedestrians from video images shot by a camera, selecting a frame of image from each video image, cutting out an image of an area occupied by each pedestrian from each frame of video image, forming a pedestrian image set B by all the cut images, and uniformly setting the sizes of the pedestrian images in the pedestrian image set B to be 384 multiplied by 124 pixels;

randomly selecting one pedestrian image from the pedestrian image set B as an inquiry target pedestrian image, and taking the rest pedestrian images in the pedestrian image set B as a candidate pedestrian image set;

and 5, training a feature learning network.

Inputting the pedestrian image training set into a feature learning network, taking the probability distribution output by the eighth feature segmentation module SoftMax classification layer as the prediction probability distribution of each pedestrian image, and taking the category to which the maximum value in the prediction probabilities belongs as the prediction label of the pedestrian image.

And calculating the cross entropy of the predicted label of each pedestrian image in the pedestrian image training set and the corresponding real label of the pedestrian image by using a label smoothing cross entropy formula, and taking the sum of all cross entropies as the loss value of the feature learning network.

The cross entropy formula for label smoothing is as follows:

wherein L is_hExpressing the cross entropy of a prediction label of the h-th pedestrian image in the pedestrian image training set and a corresponding real label of the pedestrian image, K expressing the total number of pedestrian image categories in the pedestrian image training set, sigma expressing the summation operation, K expressing the serial number of the pedestrian image categories in the pedestrian image training set, and epsilon expressing the value of 0.1Smoothing parameter, q_h(k) Representing the probability that the h-th pedestrian image in the pedestrian image training set is a K (K is more than or equal to 1 and less than or equal to K) type real label, if the h-th pedestrian image real label is K (K is more than or equal to 1 and less than or equal to K), the real label probability value is 1, if the h-th pedestrian image real label is not K (K is more than or equal to 1 and less than or equal to K), the real label probability value is 0, log represents logarithm operation with 2 as the base, p is_h(k) And the probability of the prediction label of the h-th pedestrian image in the pedestrian image training set as the K-th (K is more than or equal to 1 and less than or equal to K) class is represented.

And training the feature learning network by using a random gradient descent method.

And 6, calculating the characteristic distance.

And inputting the inquiry target pedestrian image and each pedestrian image in the candidate pedestrian image set into the feature learning network, and taking the feature mapping image output by the full convolution layer of the eighth feature segmentation module in the feature learning network as each pedestrian image feature.

And calculating the characteristic distance between the image characteristic of the query target pedestrian and the image characteristic of each pedestrian in the candidate pedestrian set by utilizing an Euclidean distance formula.

The euclidean distance formula is as follows:

wherein d (x, y) represents the distance of vector x from vector y in Euclidean space,

representing an evolution operation, n representing the dimension of the vector, Σ representing a summation operation, j representing the number of dimensions, x_jRepresenting the value of the j-th dimension of the vector x, y_jRepresenting the value of the jth dimension of the vector y.

And 7, obtaining a matching image.

The effect of the present invention will be further described with reference to simulation experiments.

1. Simulation conditions are as follows:

the hardware test platform of the simulation experiment of the invention is as follows: CPU is Intel (R) core (TM) i9-7900X, the main frequency is 3.3GHz, the memory is 32GB, and the GPU is NVIDIA 1080 Ti; the software platform is as follows: ubuntu 16.04 LTS.

2. Simulation content and result analysis:

the simulation experiment of the invention adopts the method of the invention to train and test the pedestrian re-identification model on two public data sets of Market-1501 and DukeMTMC-reiD, and the simulation result is shown in figure 3, wherein figure 3(a) is the image of the query target pedestrian selected during the simulation experiment test of the invention. Fig. 3(b) to (u) show 20 best matching images, where the 20 best matching images are the first 20 images with the smallest characteristic distance obtained by sorting all the images in the candidate pedestrian set in ascending order of the characteristic distance by using the method of the present invention, and the 20 images are the best matching images with the target pedestrian image. Wherein the pedestrian in 7 images in fig. 3(b) to (h) is the same pedestrian as the pedestrian in the query target pedestrian image, and is an accurate recognition result of pedestrian re-recognition, and the pedestrian in 13 images in fig. 3(i) to (u) is not the same pedestrian as the pedestrian in the query target pedestrian image.

In order to evaluate the accuracy of the pedestrian re-identification model obtained by the method of the present invention, the values of four points, Rank-1, Rank-5, Rank-10, and Rank-20, in the cumulative matching curve (CMC curve) were used for evaluation, and the obtained results are shown in table 1.

And the cumulative matching curve (CMC curve) Rank-t is the percentage of the pedestrian images which are the same as the target pedestrian in the first t best matching images in the pedestrian candidate set and all the pedestrian images which are the same as the target pedestrian.

TABLE 1 precision evaluation chart of pedestrian re-identification model

As can be seen from table 1, the pedestrian re-identification result of the present invention has higher accuracy on two data sets, which indicates that the present invention utilizes the multi-scale feature learning module to extract multi-scale features, and utilizes the feature segmentation module to extract two local features of global features and thick and thin different granularities under different scales, and the extracted features have good distinguishability and high robustness, so that the pedestrian re-identification achieves higher accuracy.

Claims

1. A pedestrian re-identification method based on multi-scale feature learning and feature segmentation is characterized in that multi-scale features of a pedestrian image are extracted by using a constructed multi-scale feature learning network, and local features of global features and thick and thin different granularities under different scales are extracted by using a constructed feature segmentation module; the method comprises the following specific steps:

(1) constructing a multi-scale feature learning module:

(2) constructing a feature segmentation module:

(2a) build eight 4 layers of characteristic and cut apart the module, its structure does in proper order: the characteristic segmentation layer → the global pooling fusion layer → the full convolution layer → the SoftMax classification layer; dividing each feature mapping graph input into a feature segmentation layer in a feature segmentation module into 1 part, 2 parts and 4 parts horizontally and inputting the divided feature mapping graphs into a global pooling fusion layer; in the global pooling fusion layer, respectively using global maximum pooling operation and global average pooling operation on the feature maps, and adding the feature maps output by the two pooling operations as the output of the pooling fusion layer;

(3) constructing a feature learning network:

(4) preprocessing a video containing pedestrians:

(5) training a feature learning network:

(6) calculating the characteristic distance:

(7) obtaining a matching image:

2. The pedestrian re-identification method based on multi-scale feature learning and feature segmentation as claimed in claim 1, wherein the residual blocks in step (1a) are nine layers, and the structure thereof is sequentially: the first batch normalization layer → the first ReLU layer → the first convolution layer → the second batch normalization layer → the second convolution layer → the third batch normalization layer → the third ReLU layer → the third convolution layer, the first batch normalization layer and the third convolution layer are connected to form the output of the residual block; the feature maps of the first convolutional layer in the residual block are set to 64, the convolutional kernel size is set to 3 × 3 pixels, the step size is set to 1 pixel, the feature map of the second convolutional layer in the residual block is set to 256, the convolutional kernel size is set to 1 × 1 pixel, the step size is set to 1 pixel, the feature map of the third convolutional layer in the residual block is set to 256, the convolutional kernel size is set to 3 × 3 pixels, and the step size is set to 1 pixel.

3. The pedestrian re-identification method based on multi-scale feature learning and feature segmentation as claimed in claim 1, wherein the parameters of each module of the multi-scale feature learning network in step (1b) are set as follows: setting the total number of convolutional layer feature mapping graphs in the multi-scale feature learning network to be 64, setting the size of a convolutional kernel to be 7 multiplied by 7 pixels, and setting the step size to be 2 pixels; the step size of the maximum pooling layer in the multi-scale feature learning network is set to 2 pixels.

4. The pedestrian re-identification method based on multi-scale feature learning and feature segmentation as claimed in claim 1, wherein the cross entropy formula of the label smoothing in step (5b) is as follows:

wherein L is_hExpressing the cross entropy of a prediction label of the h-th pedestrian image in the pedestrian image training set and a corresponding real label of the pedestrian image, K expressing the total number of pedestrian image categories in the pedestrian image training set, sigma expressing the summation operation, K expressing the serial number of the pedestrian image categories in the pedestrian image training set, epsilon expressing a smoothing parameter with the value of 0.1, q_h(k) Representing the probability that the h-th pedestrian image in the pedestrian image training set is a K-th real label, K is more than or equal to 1 and less than or equal to K, log represents logarithm operation with 2 as a base, and p_h(k) And representing the probability that the h-th pedestrian image in the pedestrian image training set is a K-th class prediction label, wherein K is more than or equal to 1 and less than or equal to K.

5. The pedestrian re-identification method based on multi-scale feature learning and feature segmentation as claimed in claim 1, wherein the euclidean distance formula in step (6b) is as follows: