CN111582154A

CN111582154A - Pedestrian re-identification method based on multitask skeleton posture division component

Info

Publication number: CN111582154A
Application number: CN202010377073.8A
Authority: CN
Inventors: 陈海英; 王慧燕
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2020-05-07
Filing date: 2020-05-07
Publication date: 2020-08-25

Abstract

The invention discloses a pedestrian re-identification method based on a multitask skeleton attitude division component. The pedestrian feature extraction and skeleton key point detection two tasks are combined to construct a model, wherein the pedestrian feature extraction network adopts an improved InceptionResNet 2 network and performs feature fusion with a skeleton key point detection branch, so that the network feature expression capability is improved, region blocking can be performed in a self-adaptive manner according to the human body, the fineness and the accuracy of detail feature extraction are improved, and the pedestrian feature extraction method is suitable for solving the problem of pedestrian re-identification which is similar in appearance feature and needs to be identified by means of appearance detail.

Description

Pedestrian re-identification method based on multitask skeleton posture division component

Technical Field

The invention relates to the technical field of computer vision, in particular to a pedestrian re-identification method based on a multitask skeleton posture division component.

Background

The pedestrian re-identification means identifying the identity of the pedestrian from pedestrian images captured by different cameras, aims to make up for the visual limitation of the current fixed camera, is combined with a pedestrian detection/pedestrian tracking technology, and can be widely applied to the fields of intelligent video monitoring, intelligent security and the like. Given an image containing a target pedestrian (query), the pedestrian re-identification (ReID) technique attempts to search for images containing the same pedestrian from a large number of pedestrian images (galleries), widely regarded as a sub-problem for image retrieval; ReID has received great attention from both academic and industrial circles because of its important theoretical value and broad application prospects.

ReID technology has developed very rapidly in recent years, but ReID remains a very challenging task due to significant changes in camera view, height, pedestrian pose, complex background, resolution, etc. Compared with a face recognition task, a scene of the ReID is more complex, some difficult problems are not solved, especially, when appearance characteristics such as dress of pedestrians are similar, recognition is a difficult task, existing detail feature extraction methods are mostly based on uniform partitioning, and fineness is not enough.

Disclosure of Invention

The invention aims to provide a pedestrian re-identification method based on a multitask skeleton attitude division component aiming at the defects of the existing pedestrian re-identification technology, which specifically comprises the following steps:

and (1) preprocessing data.

The sample image is normalized, for example, an input image of 512 × 512 size, and if the sample image is larger than the size, the sample image is randomly cropped, and if the sample image is smaller than the size, the sample image is proportionally enlarged and then cropped.

And (2) designing a network model for feature extraction.

The pedestrian re-recognition model based on multi-task skeleton posture division comprises two branches: the pedestrian characteristic extraction branch and the skeleton key point detection branch.

The pedestrian feature extraction branch is a main network, an improved IncepotionResNetv 2 is adopted as a backbone network, the last downsampling layer of the original IncepotionResNetv 2 is temporarily discarded, and a space tensor feature set (TenstorT) is obtained, so that the global features of pedestrians can be obtained.

The framework key point detection branch adopts a VGG network structure, a confidence map is output at the end of the network through 1-1 convolution, the number of layers of the confidence map is the same as the number of human body joint points, and each layer represents a heat map of one joint point. And performing component division by using the skeleton key points obtained by the skeleton key point detection branches, and dividing the components into seven parts, namely seven space tensors alpha in the horizontal direction to obtain the local features of the pedestrians.

Fusing the global features and the local features in a vector splicing manner, if the dimensions of the two feature vectors are the same, directly fusing in the vector splicing manner; if the spatial tensor mu is different in dimensionality, the spatial tensor mu can be converted into vectors in the same dimensionality through linear transformation, and then fusion is carried out in a vector splicing mode to enhance the expression capacity of the features, so that seven spatial tensors mu are obtained.

And finally, carrying out average merging (averaging) on the seven space tensors mu to obtain seven column vectors beta, carrying out channel dimensionality reduction by using 1 x 1 convolution to obtain seven column vectors gamma, connecting the seven column vectors gamma to 7 full connection layers (FCLayer), and carrying out classification by Softmax to obtain seven eigenvectors, wherein weights in the whole process are not shared.

And (3) training the model by adopting a label smooth loss function so as to optimize network parameters.

Training is carried out on an ImageNet database according to the skeleton posture to obtain a pre-training network, then seven eigenvectors (weight values are not shared) generated in the step (2) are input into a label smoothing loss function to obtain seven loss functions, and model parameters of pedestrian re-recognition of the defined skeleton posture division component are trained by utilizing a back propagation algorithm until the whole network model is converged.

During the test in the step (4), seven column vectors gamma are combined into a (localization) feature vector in a point-by-bit addition mode, Euclidean distances of the specified object in the query set and each object in the candidate set are calculated, and then the calculated distances are sorted in an ascending order to obtain an identification result.

The invention has the beneficial effects that: the method provided by the invention can adaptively divide the regions into blocks according to the human body shape, improves the fineness of detail feature extraction compared with the existing method, and is suitable for solving the problem of the ReID of the pedestrian which has similar appearance features and needs to be identified by means of the appearance details.

Drawings

FIG. 1 is a flow chart according to the present invention;

fig. 2 is a diagram of an overall network architecture according to the present invention.

Detailed Description

In order to describe the present invention more specifically, the following detailed description of the technical solution of the present invention is made with reference to the accompanying drawings and the detailed description, and the flow of an embodiment of the method is shown in fig. 1. The invention relates to a pedestrian re-identification method based on a skeleton attitude division component, which comprises the following steps of:

step (1), data preprocessing

A sufficient number of sample images (100) are acquired, which can be downloaded from the network (Market1501, DukeMTMC-reiD, CUHK03) or can be self-filmed.

The sample image is normalized (101), and for example, an input image of 512 × 512 size is cut randomly if the sample image is larger than the size, and is enlarged in equal proportion and cut again if the sample image is smaller than the size.

Step (2) designing a network model to extract features

The input picture data is input to a device adopting a modified IncepotionResNetv 2 as a backbone network, and the IncepotionResNetv 2 can fuse feature maps of different scales during training.

The modified incopetionresnetv 2 input is first passed through the stem structure (202), i.e. the input is 3 channels, i.e. RGB channels of pictures, and the output is 256 channels through the stem network structure.

And then 256 channels of data output by the stem network are input into 5 increment-ResNet-A (203) networks, and the output is still 256 channels.

The output of the 5 increment-ResNet-A network is input into Reduction-A (204) with 256 channels, and the output is the convolution of 896 channels.

The result output from Reduction-A is input to 10 increment-ResNet-B (205), and a convolution with 896 channels is obtained.

The output result of the inclusion-ResNet-B is input into Reduction-A (206), and convolution with 1792 output channels is obtained.

And (3) inputting the result of the Reduction-A into 5 increment-ResNet-C (207) to obtain convolution with 1792 channels, and obtaining a space Tensor feature set Tensor T, namely the global features of the pedestrians.

And then 7 parts are divided by the skeleton key points obtained by the skeleton key point detection network branch (208), and the parts are divided into 7 parts in the horizontal direction, namely 7 space tensors alpha, so that the local features of the pedestrians are obtained. The pedestrian re-identification method based on the multitask skeleton posture division divides the pedestrian by 7 parts, namely, 14 key points of the human body are utilized to extract local features to improve the accuracy of the pedestrian re-identification, the 7 parts are respectively a head, the upper body is divided into two parts according to the key points of the elbow of the pedestrian, the crotch is one part, the legs are divided into two parts according to the knee joints, then the feet are one part, and the human body is divided into 7 parts in total, so that the method is beneficial to extracting the local features of the pedestrian without damaging the important features of the pedestrian.

Then, fusing the global features and the local features in a vector splicing mode, and if the dimensions of the two feature vectors are the same, directly fusing in the vector splicing mode; if the spatial tensor mu is different in dimensionality, the spatial tensor mu can be converted into vectors in the same dimensionality through linear transformation, and then fusion is carried out in a vector splicing mode to enhance the expression capacity of the features, so that 7 spatial tensors mu are obtained.

Finally, the 7 spatial tensors μ are averaged and pooled (averaging) to obtain 7 column vectors β. And (3) using 1 × 1 convolution to reduce the number of the channels to obtain 7 column vectors γ, connecting the column vectors γ with 7 full-connected layers, classifying by Softmax to obtain 7 feature vectors (209), wherein the weight in the whole process is not shared, and the training process is equivalent to 7 losses.

And (3) passing the input picture through a skeleton key point detection branch (208), wherein the input picture passes through a classic VGG structure and is convoluted by 1 x 1, and a confidence map is output, and if the human body has p joint points, the confidence map has p layers, and each layer represents a heat map of one joint point. The loss of the stage is calculated by the confidence map and the label and stored, and the loss of each layer is added at the end of the network to be used as the total loss for reverse transmission, so that intermediate supervision is realized, and the gradient disappearance is avoided.

Step (3), model training (102)

And performing combined training according to the pedestrian feature extraction branch and the skeleton key point detection branch (208), performing feature fusion on feature vectors generated by the network in a vector splicing manner, inputting the feature vectors into a label smooth loss function, and training defined network model parameters for pedestrian re-identification by using a back propagation algorithm to optimize the parameters of the network model, wherein the label smooth loss is adopted in the model training.

Classification of pedestrian re-identification often uses a cross-entropy loss function:

wherein N is the total number of pedestrians and is a pedestrian label. When an image i is input, y_iIs the pedestrian's label in the image if y_iIs class i, which has a value of 1, otherwise it is 0. p is a radical of_iIs the probability that the network predicts that the pedestrian belongs to tag i pedestrian.

The reason for introducing the label smoothing loss function is that the cross entropy loss function excessively depends on a correct pedestrian label, so that the phenomenon of overfitting training is easily caused, and the overfitting phenomenon in the training process is avoided. A small number of error labels may exist in a pedestrian training sample, the error labels may have a certain influence on a prediction result to a certain extent, and the label smoothing loss function may also be used for preventing the model from excessively depending on the labels in the training process. Therefore, the pedestrian label smoothing processing is to set an error rate for the labels in the training process, and train by taking 1-as a real label.

Wherein N is the total number of pedestrians and is a pedestrian label. When an image i is input, y_iIs the pedestrian's label in the image if y_iIs class i, which has a value of 1, otherwise it is 0. p is a radical of_iThe network predicts that the pedestrian belongs to the tag i pedestrianThe probability of (c). Is the tag error rate.

Step (4), model test (103)

Aiming at a query set and a candidate set contained in a pedestrian re-identification data set, calculating Euclidean distances of a specified object in the query set and each object in the candidate set, merging 7 column vectors Gamma together in a vector splicing mode during testing, and calculating similarity. And then, sequencing the calculated distances in an ascending order to obtain a sequencing result of pedestrian re-identification and a pedestrian re-identification result.

In conclusion, the present invention provides a new method for segmenting pedestrian components based on skeleton postures to extract local features without resorting to segmentation estimation for re-identification, based on a large number of uncontrolled variation sources, such as significant changes in pose and viewpoint, complex changes in illumination and poor image quality, challenges faced by ReID. The method utilizes the convolution neural network method to implicitly learn the human body posture from the monocular RGB image by utilizing the feature of the image and the space model related to the image to divide the human body parts, and provides the pedestrian re-identification method based on the skeleton posture division part, which brings certain improvement of accuracy rate and is a reasonable mode in the process of pedestrian identification.

Claims

1. The pedestrian re-identification method based on the multitask skeleton attitude division component is characterized by comprising the following steps of:

preprocessing data;

acquiring a sufficient number of sample images, and carrying out normalization pretreatment on the sample images;

step (2) designing a network model for feature extraction;

the network model consists of two branches: a pedestrian feature extraction branch and a skeleton key point detection branch;

the pedestrian feature extraction branch is a main network, an improved InceptionResNetv2 is used as a backbone network, namely, the last downsampling layer of the original InceptionResNetv2 is temporarily discarded to obtain a space tensor feature set, so that the global features of pedestrians can be obtained;

the framework key point detection branch adopts a VGG network structure, a confidence map is output at the end of the network through 1-to-1 convolution, the number of layers of the confidence map is the same as the number of human body joint points, and each layer represents a heat map of one joint point; the skeleton key points obtained by the skeleton key point detection branches are divided into seven parts, namely seven space tensors alpha, according to the horizontal direction, so that the local features of the pedestrian are obtained;

fusing the global features and the local features in a vector splicing mode to obtain seven space tensors mu, then averagely converging the seven space tensors mu to obtain seven column vectors beta, then performing channel dimensionality reduction by using 1 x 1 convolution to obtain seven column vectors gamma, connecting the seven column vectors gamma to seven full-connection layers, and classifying by Softmax to obtain seven eigenvectors;

step (3) training the network model by adopting a label smooth loss function to enable network parameters to be optimal;

and (4) during testing, combining seven column vectors gamma into a feature vector in a point-by-bit addition mode, calculating the Euclidean distance between the specified object in the query set and each object in the candidate set, and then sequencing the calculated distances in an ascending order to obtain a recognition result.

2. The pedestrian re-identification method based on the multitask skeleton attitude division component as claimed in claim 1, wherein: the pretreatment in the step (1) is specifically as follows: setting the size of an input image, and if the sample image is larger than the size, performing random cutting to obtain the sample image; and if the sample image is smaller than the size, performing equal-scale amplification and then cutting.

3. The pedestrian re-identification method based on the multitask skeleton attitude division component as claimed in claim 1, wherein: if the dimensions of the two characteristic vectors are the same in the step (2), directly fusing in a vector splicing mode; if the feature is in different dimensions, the feature is converted into a vector in the same dimension through linear transformation, and then the vector is fused in a vector splicing mode to enhance the expression capability of the feature.

4. The pedestrian re-identification method based on the multitask skeleton attitude division component as claimed in claim 1, wherein: and (3) specifically, training on an ImageNet database according to the skeleton posture to obtain a pre-training network, inputting the seven eigenvectors generated in the step (2) into a label smoothing loss function to obtain seven loss functions, and training the network model parameters by using a back propagation algorithm until the whole network model converges.