CN111797813B

CN111797813B - Partial pedestrian re-identification method based on visible perception texture semantic alignment

Info

Publication number: CN111797813B
Application number: CN202010708118.5A
Authority: CN
Inventors: 高赞; 高立帅; 张桦
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2022-08-02
Anticipated expiration: 2040-07-21
Also published as: CN111797813A

Abstract

A partial pedestrian re-identification method (TSA) based on visible perception texture semantic alignment can simultaneously and efficiently solve the two problems of pedestrian occlusion and posture or observation visual angle change. The method comprises the following specific steps: (1) designing a local area alignment network based on human body postures, and mainly solving the problem that pedestrians are shielded; (2) designing a texture alignment network based on semantic visibility, and mainly solving the problem of pedestrian posture change or visual angle change; (3) in order to enable the model to have better generalization capability, the two networks are subjected to joint learning, so that the model can better deal with the problems that pedestrians are shielded and the posture or the observation visual angle changes. The method carries out efficient partial pedestrian re-recognition based on visible perception texture semantic alignment and human body posture partial region alignment, can effectively solve the problems of shielding and posture change in pedestrian re-recognition, is high in convergence speed, and can realize efficient re-recognition in pedestrian shielding.

Description

Partial pedestrian re-identification method based on visible perception texture semantic alignment

Technical Field

The invention belongs to the technical field of computer vision and pattern recognition, and relates to a partial pedestrian re-recognition method (TSA) based on visible perception texture semantic alignment, which can align on textures and local areas simultaneously and can solve the problems of blocking and posture transformation in pedestrian re-recognition.

Background

In recent years, with the development of infrastructure, tens of millions of non-overlapping (non-intersecting) cameras have been placed in various corners of many cities in order to secure public security and safety. In some special cases, when a target person disappears from one camera, it is desirable to quickly re-identify the target person from the other cameras. This is a research focus in the field of computer vision and machine learning-the pedestrian re-identification (ReID) task. Because of its importance in public security and safety, many approaches have been proposed. All these methods assume that the whole person can be completely covered by any camera, but in real monitoring environments occlusion problems often occur. Due to the obvious difference between the partial human figure and the whole human figure, most of the existing ReiD methods cannot identify the target human when being applied to the task of re-identifying partial pedestrians, and the performance of the existing ReiD methods is sharply reduced. Since a local perspective of a person may be any part of the body, this part often needs to be scaled to a fixed size image in order to match it to the overall image of the person, during which unwanted distortions occur, leading to performance degradation. Therefore, some studies consider how to identify arbitrary parts of the occluded pedestrian image, such as SWM, AMC, DSR, and DCR. However, in these methods, pedestrians in the image are divided into independent modules, and then these methods calculate the image matching degree based on these blocks, but the unshared regions still remain, so these unshared blocks will become noise in the matching process of the bust and the whole body, resulting in the occurrence of the misalignment, which affects the matching performance.

Disclosure of Invention

The invention aims to solve the problems of inconsistent input image dimensions in the conventional classical algorithms SWM 1 and AWC 1, the problem of non-shared region feature interference in DSR 2 and DCR 3 algorithms and the problem of human posture transformation existing in a real scene in the methods, and provides a partial pedestrian re-identification method (TSA) based on visible perception texture semantic alignment.

Technical scheme of the invention

A partial pedestrian re-identification method (TSA) based on visible perception texture semantic alignment specifically comprises the following steps:

1, designing a local area alignment network based on human body postures;

step 1.1, a scheme for aligning local regions of human body parts is provided to further solve the problem of shielding;

for a complete pedestrian picture, 17 human body position key points are obtained by utilizing posture estimation, and are respectively determined as eyes, ears, mouths, shoulders, elbows, hands, hips and knees, feet, the pedestrian is divided into 5 regions in the longitudinal direction, and the 5 regions are respectively: head, trunk, upper leg, lower leg, and foot, denoted as V _i And i is 1, 2, 3, 4 and 5, then judging which region is occluded according to the condition of key point missing, and if the region is occluded, V _i Equal to 0, not occluded V _i Equal to 1;

step 1.2, further solving the pose transformation problem by using a pixel-level classification scheme;

classifying each pixel point through a softmax classifier, wherein the total number of the classes is 5 region classes obtained in the step 1.1; performing softmax classification on the corresponding area of each pedestrian picture, wherein the number of the classes is the number of the corresponding pedestrians in the training set; using V obtained in step 1.1 _i Calculating the Euclidean distance between the query and the galery by the information, namely calculating the distance between the visible block of a part of pedestrian pictures and the corresponding block of the complete pedestrian pictures; in the aspect of selection of the basic network, selecting Resnet as the network model backbone of the step;

step 1.3, calculating the cross entropy loss and the Euclidean distance on the basis of the step 1.2:

a. in the training process, each pixel point is subjected to classified cross entropy loss, wherein the class label is 5 region serial numbers corresponding to the step 1.1; b. in the training process, each picture is classified to have cross entropy loss, wherein the class label is a label corresponding to the image in the training set; c. designing a triple loss function according to the Euclidean distance between the pictures calculated in the step 1.2;

2, designing a texture alignment network based on semantic visibility;

the human body consists of a 3D grid and a texture map under a UV coordinate system, and the texture alignment scheme is to calculate the distance between pedestrians by using the texture map features corresponding to the body parts, so that the problem of the transformation of the posture of the human body and the shooting visual angle of a camera is solved;

2.1, generating a pedestrian image texture map by using a texture generator, wherein the texture map under the UV coordinate system can realize the angle invariance of the characteristic map; a human body semantic segmentation model trained by an EANet method is used for classifying components of each pedestrian in a ReiD data set, the model is used for training a COCO-Part14 data set to divide pedestrian pictures into 14 classes, and the 14 classes are respectively of body parts: the head, the trunk, the left upper arm, the right upper arm, the left lower arm, the right lower arm, the left hand, the right hand, the left upper leg, the right upper leg, the left lower leg, the right lower leg, the left foot and the right foot can also know which part is lost or blocked by utilizing human semantic segmentation information;

step 2.2, the human body semantic segmentation model trained in the step 2.1 can judge which part of the human body is shielded on the half-length picture, and is marked as V _j If occluded equals 0, unoccluded equals 1;

obtaining whether each body part is shielded according to the texture maps obtained in the step 2.3 and the step 2.1 and obtaining information V of whether each body part is shielded according to the step 2.2 _j Calculating a texture map corresponding to each body part; then, splicing the components, classifying each synthesized part by softmax, calculating the Euclidean distance between query and galery at the moment, and selecting Resnet as the 2 nd-step network model backbone in the aspect of selecting a basic network at the moment;

3, joint learning of the two networks;

the local area alignment network based on human body gestures in the step 1 is specially designed for solving the problem of shielding, the texture alignment network based on semantic visibility in the step 2 is specially designed for solving gesture changes, the networks corresponding to the two steps have re-identification functions, the re-identification functions can be respectively carried out on the networks, but shielding and gesture diversity often simultaneously occur in part of pedestrian re-identification tasks; therefore, it is necessary to solve these two problems at the same time, so that two networks are jointly learned and trained by using jointlylearing;

step 3.1, performing element-wise operation on the feature graphs obtained by the two branch networks in the step 1 and the step 2 to obtain fusion features, and then combining the global features with the local features to improve the ReID performance;

and 3.2, performing softmax classification by using the global features. Performing softamx classification on the fusion features obtained in the step 3.1, and calculating cross entropy loss, wherein the number of categories is the number of pedestrians;

3.3, matching the human body position blocks by using local features, calculating Euclidean distances between the human body position blocks of the query and the galery images fused with the features in the step 3.1, and designing a ternary loss function;

4, selecting a model training data set and a model testing data set, and verifying the effectiveness of the algorithm on the testing data set;

in order to be close to a real scene, taking a Market1501 as a training set, and cutting a whole-body picture in a proportion of 0-50% to obtain a half-body picture; a test set using two sets of bust data, PartialREID and Partial-iLIDS, wherein the PartialREID has 600 pictures from 60 pedestrians, each pedestrian has 5 full body pictures and 5 bust pictures; the Partial-iLIDS had 476 pictures from 119 pedestrians, each with 3 full body and 1 half body. The experimental results of two test data sets show that the performance of the method is improved by 5% and 6.4% respectively on Rank-1 compared with the best VPM method in the prior art.

The advantages and beneficial effects of the invention;

1) the features are made to have spatial invariance (TEA) by using texture semantic alignment. 2) By adaptively aligning the local pedestrian image with the overall pedestrian image through local area alignment (PRA) based on human body posture, the negative effects of irrelevant areas or covered areas are reduced as much as possible. 3) By classifying the pixel points, the problem of posture change is effectively solved. 4) Model optimization is performed through a joint learning strategy, and the convergence rate is improved.

Drawings

Fig. 1 is a flow chart of a part of a pedestrian re-identification method TSA according to the present invention.

Fig. 2 is a structural diagram of adaptive alignment of a human posture local region.

FIG. 3 is a block diagram of texture semantic alignment.

FIG. 4 is a relationship that corresponds a texture map in a TEA subnetwork branch to a human region in a PRA subnetwork, where: a. an original pedestrian picture; b. longitudinally dividing the image into 5 parts through attitude estimation; c. the category distribution condition of each pixel point in the image is obtained; d. human semantic segmentation to obtain 14 categories; e. the generated texture map is divided into 14 corresponding human body parts according to human body semantic division information, and then the parts are subjected to merge operation for aligning with the features of the maps a and b. After human semantic segmentation, the pedestrian image can be divided into 14 classes as shown in d, and then in order to make the body part semantically aligned with the images b and c, the body part is spliced in an image e mode.

Fig. 5 is a comparison between the existing method for solving the problem of pedestrian re-identification occlusion and the method proposed in the present invention on rank-k, wherein the corresponding documents of the comparison method in fig. 5 are as follows:

[1]Wei Shi Zheng,Li Xiang,Xiang Tao,Shengcai Liao,Jianhuang Lai,and Shaogang Gong.Par-tial person re-identification.In IEEE International Con-ference on Computer Vision(CVPR),2016.

[2]Lingxiao He,Jian Liang,Haiqing Li,and Zhenan Sun.Deep spatial feature reconstruction for par-tial person re-identification:Alignment-free approach.In Computer Vision and Pattern Recognition(CVPR),2018.

[3]Zan Gao,Lishuai Gao,Hua Zhang,Zhiyong Cheng,and Richang Hong.Deep spatial pyramid features collaborative reconstruction for partial person reid.In ACM International Conference on Multimedia,2019.

[4]Xin Jin,Cuiling Lan,Wenjun Zeng,Guo-qiang Wei,and Zhibo Chen.Semantics-aligned repre-sentation learning for person re-identification.In Thirty-Fourth AAAI Conference on Artificial Intelligence,2020.

[5]Hao Luo,Xing Fan,Chi Zhang,and Wei Jiang.Stnreid:Deep convolutional networks with pair-wise spatial transformer networks for partial person re-identification.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2019.

[6]Yifan Sun,Qin Xu,and Yali et al.Li.Perceive where to focus:Learning visibility-aware part-level features for partial person re-identification.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2019.

FIG. 6 shows that the TSA performance has obvious advantages compared with the existing method for solving the problem of pedestrian re-identification shielding and the method provided by the invention on the ROC curve. Where A is the performance comparison on the Partial-iLIDS dataset and B is the performance comparison on the Partial REID dataset.

FIG. 7 is a comparison of performance of a texture alignment network based on semantic visibility and a local area alignment network based on human body pose, respectively, using a joint learning strategy. Wherein A is the performance comparison in the Partial REID data set, B is the performance comparison in the Partial-iLIDS data set, and the joint learning model is found to be more extensive.

FIG. 8 is a graph A, B showing the effect of visual perceptual information on component feature alignment on Partial REID and Partial-iLDS datasets, respectively, to demonstrate the positive effect of visual perceptual information on algorithm robustness; graph C, D shows the effect of visual perception information on the alignment of textural features on Partial REID and Partial-iLIDS datasets, respectively.

Fig. 9 shows the rapid convergence of TSA during training.

Detailed Description

We perform network design by specific 3 steps: firstly, designing a local area alignment network based on human body gestures, wherein the step mainly aims at solving the problem of shielding; then, designing a texture alignment network based on human body semantic information visibility, wherein the step mainly solves the problems of posture transformation and camera angle transformation; finally, in order to unify the two solution directions, a joint learning network is designed. The invention is further described below with reference to the accompanying drawings.

Example 1

As shown in fig. 1, a flowchart of the operation of a partial pedestrian re-identification method (TSA) based on semantic alignment of visible perceptual textures according to the present invention is shown, and the flowchart includes 3 parts: 1. a local area alignment network (PRA) based on human body pose; 2. texture based on human semantic information visibility to its network (TEA); 3. and (4) joint learning strategy. The method comprises the following operation steps:

step 1, designing a local area alignment network based on human body posture

The pedestrian is divided into 5 regions by 17 key points obtained by attitude estimation (KD) as shown in the lower branch of FIG. 2, and then which region is occluded is judged according to the condition of key point missing. Is recorded as V _i If occluded equals 0, unoccluded equals 1. The id classification penalty function of the visible region part is therefore as

Wherein L is ^id Indicating the IDE classification loss corresponding to each regional characteristic,

indicating the classification loss of the part of the IDE in the visible area corresponding to each pedestrian.

And then further solving the pose transformation problem by using a pixel-level classification scheme. Each pixel point is subjected to the next branch to obtain 5 areas for softmax classification, and the loss function of the step is

Wherein h is _j Representing the jth pixel value, R _i Indicating the index positions of the divided regions, and performing softmax classification after the feature map passes through a 1x1 convolution layer W, P (R) _i |h _j ) Represents h _j Is of the formula R _i The probability of (d); when pixel information h _j Belong to an index region R _i Where (i ═ j) Γ ═ 1, otherwise Γ ═ 0.

According toV by PRA branching _i Signal, calculating the Euclidean distance between query and galery

Wherein the content of the first and second substances,

indicating the R-th image in the query image _i The characteristics of the light source are corresponded to,

representing the R-th image in the galery image _i Corresponding to the characteristic, p represents that the pedestrian image is longitudinally divided into p blocks, and in the paper, p is 5.

The experimental result of the step 1 is analyzed, and a PRA network is found to have good performance in solving the shielding problem corresponding to the TSA-PRA experimental method in FIG. 5.

Step 2, designing texture alignment network based on semantic visibility

As shown in the flow chart of texture semantic alignment of fig. 3, a pedestrian image texture map is generated in the lower branch using a texture generator; then, whether each part is shielded or not is specifically known in the upper branch by utilizing human semantic information; and multiplying the information characteristics obtained by the two branches to obtain the texture map corresponding to each body part. To semantically align the image features generated by the lower branch (TEA) and the upper branch (PRA) in fig. 1, i.e. corresponding to the part of the explanation of fig. 5 in the description of the figures, we fuse this texture map using the strategy of fig. 4.

And judging which part of the human body is shielded by an upper branch pedestrian part semantic segmentation model (PPS). Is recorded as V _j If occluded equals 0, unoccluded equals 1, so the id classification penalty function for the visible region part is

Wherein L is ^id Representing each regionThe IDE classification loss corresponding to the feature,

V obtained from PPS _j The signal, the invisible part is characterized as zero matrix, so that the Euclidean distance between query and galery is

Wherein the content of the first and second substances,

indicating the R-th image in the query image _j The characteristics of the light source are corresponded to,

representing the R-th image in the galery image _j Corresponding to the characteristic, p represents that the pedestrian texture images are merged into p blocks according to the strategy of fig. 4, and p is 5 in the paper.

The experimental result of the step 2 is analyzed, and the TEA network is found to have good performance in solving the occlusion problem corresponding to the TSA-TEA experimental method in FIG. 5.

Step 3, the two networks are subjected to joint learning

The texture alignment network (TEA) based on semantic visibility is specially designed for solving the posture change, the local area alignment network (PRA) based on the human body posture is specially designed for solving the occlusion problem, but the occlusion and posture diversity often simultaneously appear in part of the pedestrian re-identification task. It is therefore necessary to solve both problems.

Firstly, summing the id classification loss functions of the step 1 and the step 2

Then construct the refractory sample triplet loss function, for each batch we randomly choose P people, then for each person we randomly pick Q pictures, so there are P x Q pictures per batch, for the anchor in each batch we are the hardest positive samples (the furthest apart of all positive samples) and the hardest negative samples (the closest apart of all negative samples) in the batch. The distance between the anchor and positive sample positive and negative sample negative pictures is the sum of Euclidean distances of step 1 and step 2

The hard sample triplet loss function thus constructed is

In the analysis of the experimental results of fig. 5 and fig. 7, it is found that the performance of the model involved by the combined learning strategy is further obviously improved.

Step 4 model training and testing

Training the model constructed in the step 3, selecting Market1501 as a training set, wherein the loss function in the whole training process is

During the test, we divided into the following steps:

1. compared with the current most advanced method;

2. evaluating the superiority of the joint learning;

3. and analyzing the advantages of the visible perception method.

Our experimental testing procedure was performed on two public bust datasets, Partial REID and Partial-iLIDS, respectively. We followed the routine test practice to evaluate the model with the average Cumulative Matching Characteristics (CMC) curve of Rank-k and the Receiver Operating Characteristics (ROC) curve, respectively. As shown in fig. 5, our method shown in fig. 6 has higher accuracy. In fig. 7, TSA-PRA indicates that the training is performed only by the PRA network branch, TSA-TEA indicates that the training is performed only by the TEA network branch, and TSA indicates that the two models are jointly learned, and we can see that the joint learning has good effectiveness from the comparison of the experimental results. The comparison result of fig. 8 shows that the visual perception method has great significance for improving the generalization capability of the model. In fig. 9, we find that the TSA method has good convergence during training.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A partial pedestrian re-identification method based on semantic alignment of visible perception textures specifically comprises the following steps:

1, designing a local area alignment network based on human body postures;

for a complete pedestrian picture, 17 human body position key points are obtained by utilizing posture estimation, and are respectively determined as eyes, ears, mouths, shoulders, elbows, hands, hips and knees, feet, the pedestrian is divided into 5 regions in the longitudinal direction, and the 5 regions are respectively: head, trunk, upper leg, lower leg, and feet are recorded as

I =1, 2, 3, 4, 5, and then determines which region is occluded according to the missing key point condition, if occluded

Equal to 0, not occluded

Equal to 1;

classifying each pixel point through a softmax classifier, wherein the total number of the classes is 5 region classes obtained in the step 1.1; performing softmax classification on the corresponding area of each pedestrian picture, wherein the number of the classes is the number of the corresponding pedestrians in the training set; using the product obtained in step 1.1

Calculating the Euclidean distance between the query and the galery by the information, namely calculating the distance between the visible block of a part of pedestrian pictures and the corresponding block of the complete pedestrian pictures; in the aspect of selection of the basic network, selecting Resnet as the network model backbone of the step;

2, designing a texture alignment network based on semantic visibility;

2.1, generating a pedestrian image texture map by using a texture generator, wherein the texture map under the UV coordinate system can realize the angle invariance of the characteristic map; a human body semantic segmentation model trained by an EANet method is used for classifying components of each pedestrian in a ReiD data set, the model is used for training a COCO-Part14 data set to divide pedestrian pictures into 14 classes, and the 14 classes are respectively of body parts: the head, the trunk, the left upper arm, the right upper arm, the left lower arm, the right lower arm, the left hand, the right hand, the left upper leg, the right upper leg, the left lower leg, the right lower leg, the left foot and the right foot, and which part is lost and blocked can be known by utilizing human semantic segmentation information;

step 2.2, the human body semantic segmentation model trained in the step 2.1 can judge which part of the human body is shielded on the half-length picture, and is recorded as

If occluded equals 0, unoccluded equals 1;

obtaining whether each body part is shielded according to the texture maps obtained in the step 2.3 and the step 2.1 and obtaining the information of whether each body part is shielded according to the step 2.2

Calculating a texture map corresponding to each body part; then, splicing the components, classifying each synthesized part by softmax, calculating the Euclidean distance between query and galery at the moment, and selecting Resnet as the 2 nd-step network model backbone in the aspect of selecting a basic network at the moment;

3, joint learning of two networks;

step 3.1, performing element-wise add operation on the feature graphs obtained by the two branch networks in the step 1 and the step 2 to obtain fusion features, and then combining the global features with the local features to improve the ReID performance;

3.2, performing softmax classification by using global features; performing softamx classification on the fusion features obtained in the step 3.1, and calculating cross entropy loss, wherein the number of categories is the number of pedestrians;

in order to be close to a real scene, taking a Market1501 as a training set, and cutting a whole-body picture in a proportion of 0-50% to obtain a half-body picture; a test set using two sets of bust data, Partial REID and Partial-iLIDS, wherein 600 pictures of Partial REID are from 60 pedestrians, each pedestrian has 5 full body pictures and 5 bust pictures; the Partial-iLIDS had 476 pictures from 119 pedestrians, each with 3 full body and 1 half body; the experimental results of two test data sets show that the performance of the method is improved by 5% and 6.4% respectively on Rank-1 compared with the best VPM method in the prior art.