CN116229580A

CN116229580A - Pedestrian re-identification method based on multi-granularity pyramid intersection network

Info

Publication number: CN116229580A
Application number: CN202310285479.7A
Authority: CN
Inventors: 苗夺谦; 李燕平
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2023-03-22
Filing date: 2023-03-22
Publication date: 2023-06-06

Abstract

The invention relates to a pedestrian re-identification method based on a multi-granularity pyramid intersection network, which is used for gradually learning the salient features of different local structures in a global context. The network consists essentially of two new designs, a multi-granularity convolutional layer and a pyramid cross-transducer learning layer. The former is intended to simulate human vision and study significant pedestrian characteristics at different granularities. The latter aims at mining local information in the global structure from a coarse to fine perspective. Considering that more semantic information is focused deeply, the method introduces a hierarchical aggregation module to integrate features learned in different stages of cross attention learning, and pedestrian features learned in a shallow layer are used as global priori of the deep semantic information. The invention can maintain good universality and robustness when the pedestrian image encounters problems such as shielding and the like.

Description

Pedestrian re-identification method based on multi-granularity pyramid intersection network

Technical Field

The invention relates to the technical field of computer vision, in particular to a pedestrian re-identification method based on a multi-granularity pyramid intersection network.

Background

Pedestrian Re-Identification (Re-ID) plays a vital role in modern intelligent monitoring technologies, such as pedestrian retrieval and behavioral analysis. However, re-ID faces challenges such as occlusion, low resolution, view/pose/field/clothing/illumination changes, etc. Therefore, it has attracted attention from many researchers.

In order to learn the discriminative features of pedestrian images, many attempts have been made to design effective structures that are robust to the challenges described above. Some studies (x.qian, y.fu, y. -G.Jiang, T.Xiang, and x.xue.multi-scale deep learning architectures for person re-identification.in proc.ieee Int.conf.comput.vis. (ICCV), pages 5399-5408,2017;Z.Zhang,C.Lan,W.Zeng,X.Jin,and Z.Chen.Relation-aware global attention for person re-identification.in proc.ieee/CVF conf.comput.vis.pattern recording (CVPR), pages 3186-3195, 2020.) proposed the successful architecture of convolutional neural networks (Convolutional Neural Network, CNN) in other computer vision tasks for extracting robust features of pedestrian images, for convolutional neural network-based methods, although they achieved good results in some specific cases, they were not sufficiently robust due to limited receptive fields of CNN. Furthermore, the downsampling operation (e.g., pooling) of CNNs can reduce the spatial resolution of feature maps while losing fine-grained features with detail, which can be detrimental to the recognition of pedestrians of similar appearance. More importantly, architectures that perform well in other vision areas do not fit well with some of the specific challenges in pedestrian re-recognition. Therefore, it is imperative to propose a specific design for pedestrian re-recognition.

In recent years, transformers have been incorporated into a variety of computer vision tasks, including image classification, object detection, and recognition, as they are capable of modeling long dependencies. The transform-based approach in Re-ID also produces comparable matching results compared to CNN-based algorithms. However, pure transformers cannot guarantee translational and scale invariance, which often occurs in Re-ID tasks. To take advantage of the long-range dependency modeling capabilities of the convertors while maintaining translational and scale invariance of CNNs, zhang et al (G.Zhang, P.Zhang, J.Qi, and H.Lu.hat: hierarchical aggregation transformers for person Re-identification.In Proceedings of the 29th ACM In-ternational Conference on Multimedia, pages 516-525, 2021.) propose an architecture consisting of CNNs for global feature learning and convertors for local feature learning that achieves good results on the disclosed large-scale Re-ID dataset. However, the input of a pure transducer is a single granularity full feature map of the pedestrian image, which limits the ability of the transducer to extract rich local information. Previous studies (Y.Sun, L.Zheng, Y.Yang, Q.Tian, and S. Wang. Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). InProc. Eur. Conf. Comput. Vis. (ECCV), pages 480-496,2018;M.Ye,J.Shen,G.Lin,T.Xiang,L.Shao,and S.C.Hoi.Deep learning for person re-identification: A sur-vey and outlook. IEEE Trans. Pattern Anal. Mach. Intll., 44 (6): 2872-2893, 2021.) have shown that horizontal partitioning is beneficial for networks to extract rich local information from pedestrian images. In addition, the network should also mine discriminant semantic information implied by various local structures in the global feature map.

Although deep learning is introduced into pedestrian re-recognition algorithm applications and has achieved breakthrough results, there is a long path to advance from application in real-world scenarios. In order to solve the problem that the appearance of different pedestrians is more similar to that of the same pedestrians due to the problems of shielding, light rays, gestures and the like, a method with strong robustness, which is suitable for the re-recognition of the pedestrians, is needed to be designed by a general robust network structure.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a pedestrian re-identification method based on a multi-granularity pyramid cross network, and aims to introduce the multi-granularity method aiming at the influence of environmental factors such as illumination, shielding, gesture and the like of pedestrian images, and provide a multi-granularity pyramid cross network with discrimination by utilizing strong remote dependence modeling capability of a Transformer.

The aim of the invention can be achieved by the following technical scheme:

the invention provides a pedestrian re-identification method based on a multi-granularity pyramid intersection network, which comprises the following steps:

acquiring image information, inputting the preprocessed image information into a pre-trained multi-granularity pyramid cross network, acquiring output characteristics, acquiring images matched with target pedestrians from a preset gallery based on the output characteristics to realize pedestrian re-recognition,

the multi-granularity pyramid intersection network comprises multi-level cascading hierarchical aggregation units, wherein each hierarchical aggregation unit comprises:

the multi-granularity convolution layer is used for acquiring obvious pedestrian characteristics under different granularities;

and the pyramid cross transform learning layer is used for capturing the local features with discriminant from coarse granularity to fine granularity based on the input features as the output of the hierarchical aggregation unit of the stage, and the input features except the first stage are acquired according to the output of the multi-granularity convolution layer of the stage and the output of the previous stage.

As a preferred technical solution, each hierarchical aggregation unit further includes a scale transformation layer disposed between the multi-granularity convolution layer and the pyramid cross-transform learning layer, and is configured to apply global max pooling to an output of the multi-granularity convolution layer to suppress background information.

As an preferable technical scheme, the multi-granularity pyramid cross network further comprises a main network connected with each hierarchical aggregation unit, and the output characteristics are obtained based on the output of the main network and the output of the last hierarchical aggregation unit.

As a preferred technical solution, the backbone network is a res net50 network.

As an optimal technical scheme, the processing process of the pyramid cross transducer learning layer on the characteristics comprises the following steps:

performing embedding processing on the input features to obtain an input feature map;

splitting an input feature map into a plurality of local feature vectors, acquiring corresponding local attention features, and combining the local attention features to acquire an overall feature map;

and carrying out channel MLP processing on the integral feature map to obtain the local feature with discriminant.

As a preferred embodiment, the local attention feature is obtained by the following formula:

/>

wherein ,Y_ij For the local attention feature, σ (·) is the softmax activation function, Q _ij In order to have a distinguishing characteristic of the scale,

for inputting the split local feature vector of the feature map, K _i 、V _i For inputting feature map->

Vector after linear transformation, < >>

For K _i D is the embedding dimension, i represents the pyramid level, j=1, 2, …, m represents the index of the local feature.

As a preferred technical solution, the channel MLP processing is implemented by the following formula:

Z＝ζ(Norm(Y)W ₁ )×W ₂ +Y

wherein Z is a distinctive feature with discriminant, ζ (·) is a Gelu activation function, norm (·) represents layer normalization, Y represents the overall feature map, W ₁ ∈R ^d×τd and W₂ ∈R ^τd×d Is a parameter that can be learned, τ is the expansion ratio.

As a preferred technical solution, the process for obtaining the pre-trained multi-granularity pyramid crossover network includes the following steps:

acquiring a training sample set, training the multi-granularity pyramid intersection network based on the training sample set, and acquiring the pre-trained multi-granularity pyramid intersection network after the function value of a loss function reaches a preset convergence condition;

wherein the loss function is based on validation loss, classification loss, and auxiliary loss acquisition.

As a preferred solution, the auxiliary loss is calculated by the following formula:

wherein ,

to aid in loss->

To verify loss->

For classification loss, S is the total number of stages.

In an preferable technical scheme, in each hierarchical aggregation unit, the multi-granularity convolution layer is sequentially connected with the pyramid cross-transducer learning layer.

Compared with the prior art, the invention has the following advantages:

(1) The universality and the robustness are effectively improved: aiming at the problem that the existing method ignores the implicit discriminant semantic information of various local structures in a global feature map, the method provides a Multi-granularity pyramid cross network comprising Multi-level hierarchical aggregation units, each hierarchical aggregation unit comprises a Multi-granularity convolution layer and a pyramid cross transform (Multi-granularity Cross Transformer Network, MCTN) learning layer, and as the Multi-level hierarchical aggregation units form a Multi-level cascade structure, the model can better utilize the content among different levels, further mine the latent semantic information in pedestrian images, and in a complex scene of a Multi-camera, the method can rapidly locate and search all results of specific pedestrians, thereby improving the universality and the robustness.

(2) The model training effect is good: in the training stage, in order to facilitate the strong characteristic representation of the learning network, auxiliary losses composed of verification losses and classification losses are added in different stages of the hierarchical aggregation module, and the verification losses, the classification losses and the auxiliary losses are used for supervising the learning of the multi-granularity cross-Transformer network, so that the model training effect is improved.

Drawings

FIG. 1 is a flow chart of a pedestrian re-recognition method based on a multi-granularity pyramid intersection network in embodiment 1;

fig. 2 is a schematic diagram of a multi-granularity pyramid crossover network.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Example 1

As shown in fig. 1, the present embodiment provides a pedestrian re-recognition method based on a multi-granularity pyramid intersection network, which aims at the problem that the existing method ignores the discriminant semantic information implicit in various local structures in a global feature map, and can quickly locate and search all results of specific pedestrians in a complex scene of multiple cameras. The method comprises the following steps:

step S1, inputting a pedestrian image I to be queried ₁ ；

Step S2, global features are obtained through processing of a backbone network ResNet 50;

step S3, inserting a multi-granularity convolution layer and a pyramid cross-transducer learning layer in the first three stages of the backbone network;

s4, aggregating the feature graphs at different stages by using a multi-level aggregation module, and mining more abundant local semantic information;

step S5, in the training stage, auxiliary losses are added in the first three stages to carry out supervision training in addition to the loss of the main network;

step S6, in the testing stage, the characteristics obtained by the main network and the multi-level aggregation module are fused to be used as final characteristic representation for testing;

and S7, outputting a final pedestrian image characteristic representation.

As illustrated in fig. 2, the multi-granularity pyramid crossover network is mainly composed of five parts:

backbone network: this allows for a strong feature representation capability for the trained ResNet architecture, considering the richness of the ImageNet dataset. Thus, we employ a pre-trained ResNet50 network on ImageNet as the network backbone architecture to extract pedestrian image global features.

Multi-granularity convolution layer: multi-granularity convolution layers are processes that simulate human vision cognizing things from different angles. Through the layer, finer granularity features can be extracted, and the performance of the model is enhanced.

And (3) scaling: to reduce the parameters of the model, to facilitate the integration of subsequent networks, global Maximum Pooling (GMP) is applied after the multi-granularity convolutional layer. GMP can restrain background information, and the obvious characteristics of pedestrians are extracted, so that the obvious characteristics of the pedestrians are more compact.

Pyramid cross-transducer learning layer: the layer is designed as a pyramid, guiding the network to find the salient features of pedestrians from a thick to thin perspective. The attention features obtained by coarse granularity are further segmented, and features with finer granularity are learned, so that complementation of pedestrian information with different granularities is realized.

And a hierarchical aggregation module: is composed of a plurality of hierarchical aggregation units for acquiring more comprehensive pedestrian characteristics. Since the shallow layer contains more detail information, features calculated by the shallow layer can be aggregated to the deep layer, and the deep layer is guided to pay more attention to fine-grained local features. At the same time, the module also helps to mine the underlying semantic information of the shallow layer through interactions between the shallow layer and the deep layer.

The implementation process of the three core parts of the multi-granularity convolution layer, the pyramid cross-transducer learning layer and the hierarchical aggregation module is specifically as follows:

(1) Multi-granularity convolution layer: it is well known that not all discriminative features in pedestrian images can be obtained directly through the backbone network, even through the ever-increasing network. Intuitively, the human visual system typically observes things from different perspectives and granularities. To this end, we apply a multi-granularity convolution layer that mimics the human visual system to the first three stages of the backbone network to more fully mine semantic information contained in pedestrian images. The input data were further analyzed by 3 different sizes of receptive fields (i.e., granularity), i.e., 3×3,5×5, and 7×7. In order to capture richer feature information, weights are not shared in the computation of feature representations of different granularity. In order to reduce the parameters required by the network and to increase the nonlinear transformation capability of the network, a 5×5 core may be decomposed into a cascade of two 3×3 cores and a 7×7 core may be decomposed into a cascade of 3 3×3 cores. The extracted features at different granularities are fused as the final feature representation of the layer. Note that convolution operations at different granularity translate into residual blocks. The input data is first convolved by 1×1 to compress the number of channels of the feature, then convolved by 3×3 to extract the feature, and then convolved by 1×1 to restore the feature to the original number of channels. Finally, the currently obtained feature is connected with the output of the shortcut as the final feature representation.

(2) Pyramid cross-transducer learning layer: in fact, the performance of the Re-ID task is largely dependent on pedestrian image semantic information extracted by the neural network. By deepening the network, the deep layer can learn the semantic information of the pedestrian image to a certain extent. However, the shallow layer contains only detail information and a small amount of semantic information, and a large amount of semantic information has not been mined yet. For this reason, we propose a pyramid cross-transducer learning layer based on the transducer architecture to enrich the diversity of features. The pyramid cross-transducer learning layer mainly comprises three components of input coding, cross attention and channel MLP.

Inputting an Embedding: given a feature map X, it is first subjected to an input encoding process similar to that of a block of Embedding of Vits (A.Dosovitskiy, L.Beyer, A.Kolesnikov, D.Weis-senborne, X.Zhai, T.Unterthiner, M.Dehghani, M.Minderer, G.Heigold, S.Gelly, et al image word 16X16 words: transformers for image recog-station at scale. ArXiv pre-print arXiv:2010.11929,2020.). The formula can be expressed as:

X ^emb ＝Norm(InputEmb(X))

where Norm (·) represents the layer normalization,

representing a sequence length of U and an encoding dimension of C _e Is embedded in the memory.

Cross attention: to capture discriminative local information in the global structure from a coarse to fine perspective, we explore the local-global relationship using a pyramid structure.

1) And (5) a pyramid. The pyramid contains different levels of attention calculations, each level splitting the feature map horizontally into 2i-1, where 2 is the cardinal number and i is the current level. Notably, the horizontal division granularity of different pyramid levels is different. As the pyramid level increases, so does the partitioning. Features learned at the coarse-grained level are used to guide the network to mine finer-grained local information. Furthermore, we set the pyramid level to 3, considering that too fine a division would compromise the integrity of the local semantic information.

2) Crossing. Map the characteristic map

Horizontal split into m local feature vectors +.>

Where i denotes the pyramid level, j=1, 2, …, and m denotes the index of the local feature. The undivided feature map is +_by two different linear transformations>

Conversion to K _i 、V _i . In order to obtain more representative and distinguishing features, we apply +.>

Using different linear transformations to produce features Q with differentiated degrees _ij . The features calculated by cross-attention can be defined as:

wherein K_i ,V _i ∈R ^U×d ，Q _ij ∈R ^U/m×d D is the number of embedding dimensions,

for K _i Is a transpose of (a). We use Q _ij ×K _i ^T The relationship between local and global features is studied. In fact, Q _ij ×K _i ^T Similar to cosine similarity and can therefore be used to measure correlation. Normalizing the resulting cross-attention weights with a softmax activation function σ (·) to V _i The multiplication extracts locally significant features in the global context. And finally, splicing the obtained features with the original local feature vectors to obtain the cross attention feature representation.

3) And (5) merging. Each local attention feature is incorporated into the global feature map of the current level as follows:

the feature map is then sent to the next layer of the pyramid, directing the network to mine finer granularity of local semantic information in the global context. Through the learning of pyramid cross convertors, the network can better find the association between local and global features, highlight the discriminant features in the local area and suppress irrelevant features.

4) Channel MLP. The channels of the feature map contain rich details and semantic information of the same parts of pedestrians. We retain the characteristics of the channel multi-layer perceptron (MLP) in the conventional transducer to compress these similar channels. Channel MLP consists of two layers of linear transformation and a gel activation function ζ (·). Assuming that the feature map of the cross-attention learning output is Y, the process can be expressed as:

Z＝ζ(Norm(Y)W ₁ )×W ₂ +Y，

wherein Z represents the output characteristics of channel MLP, W ₁ ∈R ^d×τd and W₂ ∈R ^τd×d Is a parameter that can be learned, τ is the expansion ratio.

(3) And a hierarchical aggregation module: considering that the shallow layer has more semantic information than the deep layer which is not yet mined, we insert the multi-granularity convolution layer and the pyramid cross-transducer learning layer into the first three stages of the backbone network at the same time. In particular, the feature maps calculated at each stage are not directly sent back to the backbone network for training, nor are they directly stitched together as the final feature representation. And on the contrary, the characteristics extracted by the multi-granularity convolution layer at the next stage are fused, and are fed back to the pyramid cross-transducer learning layer to perform pyramid cross-transducer learning, so as to guide the network to find diversified clues of the pedestrian images. Through the hierarchical aggregation module, the content among different hierarchies can be better utilized, and potential semantic information in the pedestrian image can be further mined.

In the training stage, the invention utilizes verification loss, classification loss and auxiliary loss to supervise the learning of the multi-granularity cross-transducer network.

Verification loss: this loss is used to enhance intra-class compactness and inter-class variability, i.e. the distance between positive and negative samples is made smaller by a predefined spacing ζ. That is, after training, the distance between the same pedestrians is as small as possible, while the distance between different pedestrians is as large as possible. We use hard triplet loss (A.Hermans, L.Beyer, and b.leibe. In defense of the triplet loss for person re-identification. ArXiv preprint arXiv, page 1703.07737,2017.) as a validation loss, defined as:

wherein ,f_a Representing an anchor sample, f _p Representing positive samples with identical identity, f _n Representing negative samples with different identities. I & I means L ₂ -Norm，[·] ₊ Representing the max function.

Classification loss: since the pedestrian re-recognition task can be regarded as an image classification problem with each pedestrian identity as a class, the classification penalty used herein is a cross entropy penalty, defined as:

where M is the number of pedestrian identities,

and y_i The predictive label and the real label of the ith pedestrian respectively.

Auxiliary loss: to facilitate learning strong feature representations of a network, auxiliary losses consisting of validation and classification losses are added at different stages of a hierarchical aggregation module

The definition is as follows: />

Where S is the total number of stages. Finally, the total loss is:

where γ is a superparameter used to balance backbone network losses and auxiliary losses.

The method learns the salient features of different local structures step by step in a global context. The network provided by the method mainly consists of two new designs, namely a multi-granularity convolution layer and a pyramid cross-transducer learning layer. The former is intended to simulate human vision and study significant pedestrian characteristics at different granularities. The latter aims at mining local information in the global structure from a coarse to fine perspective. Furthermore, considering that deep attention is focused on more semantic information, finer granularity attention learning is not needed to avoid overfitting, while shallow attention is focused on details, but a large amount of semantic information is not yet mined. Therefore, a hierarchical aggregation module is introduced to fuse the features learned in different stages of cross-attention learning, and the pedestrian features learned in the shallow layer are used as global priori of deep semantic information. According to the invention, a multi-granularity method is introduced aiming at the influence of environmental factors such as illumination, shielding, gesture and the like of a pedestrian image, and a network structure with distinguishing property is designed by utilizing the strong remote dependency modeling capability of a transducer. Even if the pedestrian image encounters the problems of shielding and the like, the invention can still keep good universality and robustness.

Example 2

The present embodiment provides an electronic device, including: one or more processors and a memory having stored therein one or more programs including instructions for performing the pedestrian re-recognition method based on the multi-granularity pyramid intersection network as described in embodiment 1.

Example 3

The present embodiment provides a computer-readable storage medium comprising one or more programs for execution by one or more processors of an electronic device, the one or more programs comprising instructions for performing the pedestrian re-recognition method based on a multi-granularity pyramid intersection network as described in embodiment 1.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The pedestrian re-identification method based on the multi-granularity pyramid intersection network is characterized by comprising the following steps of:

2. The pedestrian re-recognition method based on a multi-granularity pyramid crossover network of claim 1, wherein each of the hierarchical aggregation units further comprises a scale transformation layer disposed between the multi-granularity convolution layer and the pyramid crossover transform learning layer for applying global max pooling for the output of the multi-granularity convolution layer to suppress background information.

3. The pedestrian re-recognition method based on the multi-granularity pyramid intersection network according to claim 1, wherein the multi-granularity pyramid intersection network further comprises a backbone network connected with each hierarchical aggregation unit, and the output characteristics are obtained based on the output of the backbone network and the output of the last hierarchical aggregation unit.

4. A pedestrian re-recognition method based on a multi-granularity pyramid crossover network as claimed in claim 3, wherein the backbone network is a res net50 network.

5. The pedestrian re-identification method based on the multi-granularity pyramid cross network according to claim 1, wherein the processing procedure of the pyramid cross transform learning layer on the features comprises the following steps:

6. The pedestrian re-recognition method based on the multi-granularity pyramid intersection network of claim 5, wherein the local attention characteristic is obtained by adopting the following formula:

The vector after the linear transformation is used for generating a vector,

7. The pedestrian re-recognition method based on the multi-granularity pyramid intersection network according to claim 5, wherein the channel MLP processing is implemented by adopting the following formula:

Z＝ζ(Norm(Y)W ₁ )×W ₂ +Y

8. The pedestrian re-recognition method based on the multi-granularity pyramid intersection network according to claim 1, wherein the pre-trained multi-granularity pyramid intersection network acquisition process comprises the following steps:

9. The pedestrian re-recognition method based on the multi-granularity pyramid intersection network of claim 8, wherein the auxiliary loss is calculated by the following formula:

wherein ,

to aid in loss->

To verify loss->

For classification loss, S is the total number of stages.

10. The pedestrian re-recognition method based on the multi-granularity pyramid intersection network according to claim 1, wherein in each hierarchical aggregation unit, the multi-granularity convolution layer is sequentially connected with the pyramid intersection transform learning layer.