CN116229580A - Pedestrian re-identification method based on multi-granularity pyramid intersection network - Google Patents

Pedestrian re-identification method based on multi-granularity pyramid intersection network Download PDF

Info

Publication number
CN116229580A
CN116229580A CN202310285479.7A CN202310285479A CN116229580A CN 116229580 A CN116229580 A CN 116229580A CN 202310285479 A CN202310285479 A CN 202310285479A CN 116229580 A CN116229580 A CN 116229580A
Authority
CN
China
Prior art keywords
granularity
pyramid
pedestrian
network
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310285479.7A
Other languages
Chinese (zh)
Inventor
苗夺谦
李燕平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202310285479.7A priority Critical patent/CN116229580A/en
Publication of CN116229580A publication Critical patent/CN116229580A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a pedestrian re-identification method based on a multi-granularity pyramid intersection network, which is used for gradually learning the salient features of different local structures in a global context. The network consists essentially of two new designs, a multi-granularity convolutional layer and a pyramid cross-transducer learning layer. The former is intended to simulate human vision and study significant pedestrian characteristics at different granularities. The latter aims at mining local information in the global structure from a coarse to fine perspective. Considering that more semantic information is focused deeply, the method introduces a hierarchical aggregation module to integrate features learned in different stages of cross attention learning, and pedestrian features learned in a shallow layer are used as global priori of the deep semantic information. The invention can maintain good universality and robustness when the pedestrian image encounters problems such as shielding and the like.

Description

Pedestrian re-identification method based on multi-granularity pyramid intersection network
Technical Field
The invention relates to the technical field of computer vision, in particular to a pedestrian re-identification method based on a multi-granularity pyramid intersection network.
Background
Pedestrian Re-Identification (Re-ID) plays a vital role in modern intelligent monitoring technologies, such as pedestrian retrieval and behavioral analysis. However, re-ID faces challenges such as occlusion, low resolution, view/pose/field/clothing/illumination changes, etc. Therefore, it has attracted attention from many researchers.
In order to learn the discriminative features of pedestrian images, many attempts have been made to design effective structures that are robust to the challenges described above. Some studies (x.qian, y.fu, y. -G.Jiang, T.Xiang, and x.xue.multi-scale deep learning architectures for person re-identification.in proc.ieee Int.conf.comput.vis. (ICCV), pages 5399-5408,2017;Z.Zhang,C.Lan,W.Zeng,X.Jin,and Z.Chen.Relation-aware global attention for person re-identification.in proc.ieee/CVF conf.comput.vis.pattern recording (CVPR), pages 3186-3195, 2020.) proposed the successful architecture of convolutional neural networks (Convolutional Neural Network, CNN) in other computer vision tasks for extracting robust features of pedestrian images, for convolutional neural network-based methods, although they achieved good results in some specific cases, they were not sufficiently robust due to limited receptive fields of CNN. Furthermore, the downsampling operation (e.g., pooling) of CNNs can reduce the spatial resolution of feature maps while losing fine-grained features with detail, which can be detrimental to the recognition of pedestrians of similar appearance. More importantly, architectures that perform well in other vision areas do not fit well with some of the specific challenges in pedestrian re-recognition. Therefore, it is imperative to propose a specific design for pedestrian re-recognition.
In recent years, transformers have been incorporated into a variety of computer vision tasks, including image classification, object detection, and recognition, as they are capable of modeling long dependencies. The transform-based approach in Re-ID also produces comparable matching results compared to CNN-based algorithms. However, pure transformers cannot guarantee translational and scale invariance, which often occurs in Re-ID tasks. To take advantage of the long-range dependency modeling capabilities of the convertors while maintaining translational and scale invariance of CNNs, zhang et al (G.Zhang, P.Zhang, J.Qi, and H.Lu.hat: hierarchical aggregation transformers for person Re-identification.In Proceedings of the 29th ACM In-ternational Conference on Multimedia, pages 516-525, 2021.) propose an architecture consisting of CNNs for global feature learning and convertors for local feature learning that achieves good results on the disclosed large-scale Re-ID dataset. However, the input of a pure transducer is a single granularity full feature map of the pedestrian image, which limits the ability of the transducer to extract rich local information. Previous studies (Y.Sun, L.Zheng, Y.Yang, Q.Tian, and S. Wang. Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). InProc. Eur. Conf. Comput. Vis. (ECCV), pages 480-496,2018;M.Ye,J.Shen,G.Lin,T.Xiang,L.Shao,and S.C.Hoi.Deep learning for person re-identification: A sur-vey and outlook. IEEE Trans. Pattern Anal. Mach. Intll., 44 (6): 2872-2893, 2021.) have shown that horizontal partitioning is beneficial for networks to extract rich local information from pedestrian images. In addition, the network should also mine discriminant semantic information implied by various local structures in the global feature map.
Although deep learning is introduced into pedestrian re-recognition algorithm applications and has achieved breakthrough results, there is a long path to advance from application in real-world scenarios. In order to solve the problem that the appearance of different pedestrians is more similar to that of the same pedestrians due to the problems of shielding, light rays, gestures and the like, a method with strong robustness, which is suitable for the re-recognition of the pedestrians, is needed to be designed by a general robust network structure.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a pedestrian re-identification method based on a multi-granularity pyramid cross network, and aims to introduce the multi-granularity method aiming at the influence of environmental factors such as illumination, shielding, gesture and the like of pedestrian images, and provide a multi-granularity pyramid cross network with discrimination by utilizing strong remote dependence modeling capability of a Transformer.
The aim of the invention can be achieved by the following technical scheme:
the invention provides a pedestrian re-identification method based on a multi-granularity pyramid intersection network, which comprises the following steps:
acquiring image information, inputting the preprocessed image information into a pre-trained multi-granularity pyramid cross network, acquiring output characteristics, acquiring images matched with target pedestrians from a preset gallery based on the output characteristics to realize pedestrian re-recognition,
the multi-granularity pyramid intersection network comprises multi-level cascading hierarchical aggregation units, wherein each hierarchical aggregation unit comprises:
the multi-granularity convolution layer is used for acquiring obvious pedestrian characteristics under different granularities;
and the pyramid cross transform learning layer is used for capturing the local features with discriminant from coarse granularity to fine granularity based on the input features as the output of the hierarchical aggregation unit of the stage, and the input features except the first stage are acquired according to the output of the multi-granularity convolution layer of the stage and the output of the previous stage.
As a preferred technical solution, each hierarchical aggregation unit further includes a scale transformation layer disposed between the multi-granularity convolution layer and the pyramid cross-transform learning layer, and is configured to apply global max pooling to an output of the multi-granularity convolution layer to suppress background information.
As an preferable technical scheme, the multi-granularity pyramid cross network further comprises a main network connected with each hierarchical aggregation unit, and the output characteristics are obtained based on the output of the main network and the output of the last hierarchical aggregation unit.
As a preferred technical solution, the backbone network is a res net50 network.
As an optimal technical scheme, the processing process of the pyramid cross transducer learning layer on the characteristics comprises the following steps:
performing embedding processing on the input features to obtain an input feature map;
splitting an input feature map into a plurality of local feature vectors, acquiring corresponding local attention features, and combining the local attention features to acquire an overall feature map;
and carrying out channel MLP processing on the integral feature map to obtain the local feature with discriminant.
As a preferred embodiment, the local attention feature is obtained by the following formula:
Figure BDA0004139661590000031
/>
wherein ,Yij For the local attention feature, σ (·) is the softmax activation function, Q ij In order to have a distinguishing characteristic of the scale,
Figure BDA0004139661590000032
for inputting the split local feature vector of the feature map, K i 、V i For inputting feature map->
Figure BDA0004139661590000033
Vector after linear transformation, < >>
Figure BDA0004139661590000034
For K i D is the embedding dimension, i represents the pyramid level, j=1, 2, …, m represents the index of the local feature.
As a preferred technical solution, the channel MLP processing is implemented by the following formula:
Z=ζ(Norm(Y)W 1 )×W 2 +Y
wherein Z is a distinctive feature with discriminant, ζ (·) is a Gelu activation function, norm (·) represents layer normalization, Y represents the overall feature map, W 1 ∈R d×τd and W2 ∈R τd×d Is a parameter that can be learned, τ is the expansion ratio.
As a preferred technical solution, the process for obtaining the pre-trained multi-granularity pyramid crossover network includes the following steps:
acquiring a training sample set, training the multi-granularity pyramid intersection network based on the training sample set, and acquiring the pre-trained multi-granularity pyramid intersection network after the function value of a loss function reaches a preset convergence condition;
wherein the loss function is based on validation loss, classification loss, and auxiliary loss acquisition.
As a preferred solution, the auxiliary loss is calculated by the following formula:
Figure BDA0004139661590000041
wherein ,
Figure BDA0004139661590000042
to aid in loss->
Figure BDA0004139661590000043
To verify loss->
Figure BDA0004139661590000044
For classification loss, S is the total number of stages.
In an preferable technical scheme, in each hierarchical aggregation unit, the multi-granularity convolution layer is sequentially connected with the pyramid cross-transducer learning layer.
Compared with the prior art, the invention has the following advantages:
(1) The universality and the robustness are effectively improved: aiming at the problem that the existing method ignores the implicit discriminant semantic information of various local structures in a global feature map, the method provides a Multi-granularity pyramid cross network comprising Multi-level hierarchical aggregation units, each hierarchical aggregation unit comprises a Multi-granularity convolution layer and a pyramid cross transform (Multi-granularity Cross Transformer Network, MCTN) learning layer, and as the Multi-level hierarchical aggregation units form a Multi-level cascade structure, the model can better utilize the content among different levels, further mine the latent semantic information in pedestrian images, and in a complex scene of a Multi-camera, the method can rapidly locate and search all results of specific pedestrians, thereby improving the universality and the robustness.
(2) The model training effect is good: in the training stage, in order to facilitate the strong characteristic representation of the learning network, auxiliary losses composed of verification losses and classification losses are added in different stages of the hierarchical aggregation module, and the verification losses, the classification losses and the auxiliary losses are used for supervising the learning of the multi-granularity cross-Transformer network, so that the model training effect is improved.
Drawings
FIG. 1 is a flow chart of a pedestrian re-recognition method based on a multi-granularity pyramid intersection network in embodiment 1;
fig. 2 is a schematic diagram of a multi-granularity pyramid crossover network.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
Example 1
As shown in fig. 1, the present embodiment provides a pedestrian re-recognition method based on a multi-granularity pyramid intersection network, which aims at the problem that the existing method ignores the discriminant semantic information implicit in various local structures in a global feature map, and can quickly locate and search all results of specific pedestrians in a complex scene of multiple cameras. The method comprises the following steps:
step S1, inputting a pedestrian image I to be queried 1
Step S2, global features are obtained through processing of a backbone network ResNet 50;
step S3, inserting a multi-granularity convolution layer and a pyramid cross-transducer learning layer in the first three stages of the backbone network;
s4, aggregating the feature graphs at different stages by using a multi-level aggregation module, and mining more abundant local semantic information;
step S5, in the training stage, auxiliary losses are added in the first three stages to carry out supervision training in addition to the loss of the main network;
step S6, in the testing stage, the characteristics obtained by the main network and the multi-level aggregation module are fused to be used as final characteristic representation for testing;
and S7, outputting a final pedestrian image characteristic representation.
As illustrated in fig. 2, the multi-granularity pyramid crossover network is mainly composed of five parts:
backbone network: this allows for a strong feature representation capability for the trained ResNet architecture, considering the richness of the ImageNet dataset. Thus, we employ a pre-trained ResNet50 network on ImageNet as the network backbone architecture to extract pedestrian image global features.
Multi-granularity convolution layer: multi-granularity convolution layers are processes that simulate human vision cognizing things from different angles. Through the layer, finer granularity features can be extracted, and the performance of the model is enhanced.
And (3) scaling: to reduce the parameters of the model, to facilitate the integration of subsequent networks, global Maximum Pooling (GMP) is applied after the multi-granularity convolutional layer. GMP can restrain background information, and the obvious characteristics of pedestrians are extracted, so that the obvious characteristics of the pedestrians are more compact.
Pyramid cross-transducer learning layer: the layer is designed as a pyramid, guiding the network to find the salient features of pedestrians from a thick to thin perspective. The attention features obtained by coarse granularity are further segmented, and features with finer granularity are learned, so that complementation of pedestrian information with different granularities is realized.
And a hierarchical aggregation module: is composed of a plurality of hierarchical aggregation units for acquiring more comprehensive pedestrian characteristics. Since the shallow layer contains more detail information, features calculated by the shallow layer can be aggregated to the deep layer, and the deep layer is guided to pay more attention to fine-grained local features. At the same time, the module also helps to mine the underlying semantic information of the shallow layer through interactions between the shallow layer and the deep layer.
The implementation process of the three core parts of the multi-granularity convolution layer, the pyramid cross-transducer learning layer and the hierarchical aggregation module is specifically as follows:
(1) Multi-granularity convolution layer: it is well known that not all discriminative features in pedestrian images can be obtained directly through the backbone network, even through the ever-increasing network. Intuitively, the human visual system typically observes things from different perspectives and granularities. To this end, we apply a multi-granularity convolution layer that mimics the human visual system to the first three stages of the backbone network to more fully mine semantic information contained in pedestrian images. The input data were further analyzed by 3 different sizes of receptive fields (i.e., granularity), i.e., 3×3,5×5, and 7×7. In order to capture richer feature information, weights are not shared in the computation of feature representations of different granularity. In order to reduce the parameters required by the network and to increase the nonlinear transformation capability of the network, a 5×5 core may be decomposed into a cascade of two 3×3 cores and a 7×7 core may be decomposed into a cascade of 3 3×3 cores. The extracted features at different granularities are fused as the final feature representation of the layer. Note that convolution operations at different granularity translate into residual blocks. The input data is first convolved by 1×1 to compress the number of channels of the feature, then convolved by 3×3 to extract the feature, and then convolved by 1×1 to restore the feature to the original number of channels. Finally, the currently obtained feature is connected with the output of the shortcut as the final feature representation.
(2) Pyramid cross-transducer learning layer: in fact, the performance of the Re-ID task is largely dependent on pedestrian image semantic information extracted by the neural network. By deepening the network, the deep layer can learn the semantic information of the pedestrian image to a certain extent. However, the shallow layer contains only detail information and a small amount of semantic information, and a large amount of semantic information has not been mined yet. For this reason, we propose a pyramid cross-transducer learning layer based on the transducer architecture to enrich the diversity of features. The pyramid cross-transducer learning layer mainly comprises three components of input coding, cross attention and channel MLP.
Inputting an Embedding: given a feature map X, it is first subjected to an input encoding process similar to that of a block of Embedding of Vits (A.Dosovitskiy, L.Beyer, A.Kolesnikov, D.Weis-senborne, X.Zhai, T.Unterthiner, M.Dehghani, M.Minderer, G.Heigold, S.Gelly, et al image word 16X16 words: transformers for image recog-station at scale. ArXiv pre-print arXiv:2010.11929,2020.). The formula can be expressed as:
X emb =Norm(InputEmb(X))
where Norm (·) represents the layer normalization,
Figure BDA0004139661590000071
representing a sequence length of U and an encoding dimension of C e Is embedded in the memory.
Cross attention: to capture discriminative local information in the global structure from a coarse to fine perspective, we explore the local-global relationship using a pyramid structure.
1) And (5) a pyramid. The pyramid contains different levels of attention calculations, each level splitting the feature map horizontally into 2i-1, where 2 is the cardinal number and i is the current level. Notably, the horizontal division granularity of different pyramid levels is different. As the pyramid level increases, so does the partitioning. Features learned at the coarse-grained level are used to guide the network to mine finer-grained local information. Furthermore, we set the pyramid level to 3, considering that too fine a division would compromise the integrity of the local semantic information.
2) Crossing. Map the characteristic map
Figure BDA0004139661590000072
Horizontal split into m local feature vectors +.>
Figure BDA0004139661590000073
Where i denotes the pyramid level, j=1, 2, …, and m denotes the index of the local feature. The undivided feature map is +_by two different linear transformations>
Figure BDA0004139661590000074
Conversion to K i 、V i . In order to obtain more representative and distinguishing features, we apply +.>
Figure BDA0004139661590000075
Using different linear transformations to produce features Q with differentiated degrees ij . The features calculated by cross-attention can be defined as:
Figure BDA0004139661590000076
wherein Ki ,V i ∈R U×d ,Q ij ∈R U/m×d D is the number of embedding dimensions,
Figure BDA0004139661590000077
for K i Is a transpose of (a). We use Q ij ×K i T The relationship between local and global features is studied. In fact, Q ij ×K i T Similar to cosine similarity and can therefore be used to measure correlation. Normalizing the resulting cross-attention weights with a softmax activation function σ (·) to V i The multiplication extracts locally significant features in the global context. And finally, splicing the obtained features with the original local feature vectors to obtain the cross attention feature representation.
3) And (5) merging. Each local attention feature is incorporated into the global feature map of the current level as follows:
Figure BDA0004139661590000078
the feature map is then sent to the next layer of the pyramid, directing the network to mine finer granularity of local semantic information in the global context. Through the learning of pyramid cross convertors, the network can better find the association between local and global features, highlight the discriminant features in the local area and suppress irrelevant features.
4) Channel MLP. The channels of the feature map contain rich details and semantic information of the same parts of pedestrians. We retain the characteristics of the channel multi-layer perceptron (MLP) in the conventional transducer to compress these similar channels. Channel MLP consists of two layers of linear transformation and a gel activation function ζ (·). Assuming that the feature map of the cross-attention learning output is Y, the process can be expressed as:
Z=ζ(Norm(Y)W 1 )×W 2 +Y,
wherein Z represents the output characteristics of channel MLP, W 1 ∈R d×τd and W2 ∈R τd×d Is a parameter that can be learned, τ is the expansion ratio.
(3) And a hierarchical aggregation module: considering that the shallow layer has more semantic information than the deep layer which is not yet mined, we insert the multi-granularity convolution layer and the pyramid cross-transducer learning layer into the first three stages of the backbone network at the same time. In particular, the feature maps calculated at each stage are not directly sent back to the backbone network for training, nor are they directly stitched together as the final feature representation. And on the contrary, the characteristics extracted by the multi-granularity convolution layer at the next stage are fused, and are fed back to the pyramid cross-transducer learning layer to perform pyramid cross-transducer learning, so as to guide the network to find diversified clues of the pedestrian images. Through the hierarchical aggregation module, the content among different hierarchies can be better utilized, and potential semantic information in the pedestrian image can be further mined.
In the training stage, the invention utilizes verification loss, classification loss and auxiliary loss to supervise the learning of the multi-granularity cross-transducer network.
Verification loss: this loss is used to enhance intra-class compactness and inter-class variability, i.e. the distance between positive and negative samples is made smaller by a predefined spacing ζ. That is, after training, the distance between the same pedestrians is as small as possible, while the distance between different pedestrians is as large as possible. We use hard triplet loss (A.Hermans, L.Beyer, and b.leibe. In defense of the triplet loss for person re-identification. ArXiv preprint arXiv, page 1703.07737,2017.) as a validation loss, defined as:
Figure BDA0004139661590000081
wherein ,fa Representing an anchor sample, f p Representing positive samples with identical identity, f n Representing negative samples with different identities. I & I means L 2 -Norm,[·] + Representing the max function.
Classification loss: since the pedestrian re-recognition task can be regarded as an image classification problem with each pedestrian identity as a class, the classification penalty used herein is a cross entropy penalty, defined as:
Figure BDA0004139661590000082
where M is the number of pedestrian identities,
Figure BDA0004139661590000083
and yi The predictive label and the real label of the ith pedestrian respectively.
Auxiliary loss: to facilitate learning strong feature representations of a network, auxiliary losses consisting of validation and classification losses are added at different stages of a hierarchical aggregation module
Figure BDA0004139661590000084
The definition is as follows: />
Figure BDA0004139661590000085
Where S is the total number of stages. Finally, the total loss is:
Figure BDA0004139661590000091
where γ is a superparameter used to balance backbone network losses and auxiliary losses.
The method learns the salient features of different local structures step by step in a global context. The network provided by the method mainly consists of two new designs, namely a multi-granularity convolution layer and a pyramid cross-transducer learning layer. The former is intended to simulate human vision and study significant pedestrian characteristics at different granularities. The latter aims at mining local information in the global structure from a coarse to fine perspective. Furthermore, considering that deep attention is focused on more semantic information, finer granularity attention learning is not needed to avoid overfitting, while shallow attention is focused on details, but a large amount of semantic information is not yet mined. Therefore, a hierarchical aggregation module is introduced to fuse the features learned in different stages of cross-attention learning, and the pedestrian features learned in the shallow layer are used as global priori of deep semantic information. According to the invention, a multi-granularity method is introduced aiming at the influence of environmental factors such as illumination, shielding, gesture and the like of a pedestrian image, and a network structure with distinguishing property is designed by utilizing the strong remote dependency modeling capability of a transducer. Even if the pedestrian image encounters the problems of shielding and the like, the invention can still keep good universality and robustness.
Example 2
The present embodiment provides an electronic device, including: one or more processors and a memory having stored therein one or more programs including instructions for performing the pedestrian re-recognition method based on the multi-granularity pyramid intersection network as described in embodiment 1.
Example 3
The present embodiment provides a computer-readable storage medium comprising one or more programs for execution by one or more processors of an electronic device, the one or more programs comprising instructions for performing the pedestrian re-recognition method based on a multi-granularity pyramid intersection network as described in embodiment 1.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

1. The pedestrian re-identification method based on the multi-granularity pyramid intersection network is characterized by comprising the following steps of:
acquiring image information, inputting the preprocessed image information into a pre-trained multi-granularity pyramid cross network, acquiring output characteristics, acquiring images matched with target pedestrians from a preset gallery based on the output characteristics to realize pedestrian re-recognition,
the multi-granularity pyramid intersection network comprises multi-level cascading hierarchical aggregation units, wherein each hierarchical aggregation unit comprises:
the multi-granularity convolution layer is used for acquiring obvious pedestrian characteristics under different granularities;
and the pyramid cross transform learning layer is used for capturing the local features with discriminant from coarse granularity to fine granularity based on the input features as the output of the hierarchical aggregation unit of the stage, and the input features except the first stage are acquired according to the output of the multi-granularity convolution layer of the stage and the output of the previous stage.
2. The pedestrian re-recognition method based on a multi-granularity pyramid crossover network of claim 1, wherein each of the hierarchical aggregation units further comprises a scale transformation layer disposed between the multi-granularity convolution layer and the pyramid crossover transform learning layer for applying global max pooling for the output of the multi-granularity convolution layer to suppress background information.
3. The pedestrian re-recognition method based on the multi-granularity pyramid intersection network according to claim 1, wherein the multi-granularity pyramid intersection network further comprises a backbone network connected with each hierarchical aggregation unit, and the output characteristics are obtained based on the output of the backbone network and the output of the last hierarchical aggregation unit.
4. A pedestrian re-recognition method based on a multi-granularity pyramid crossover network as claimed in claim 3, wherein the backbone network is a res net50 network.
5. The pedestrian re-identification method based on the multi-granularity pyramid cross network according to claim 1, wherein the processing procedure of the pyramid cross transform learning layer on the features comprises the following steps:
performing embedding processing on the input features to obtain an input feature map;
splitting an input feature map into a plurality of local feature vectors, acquiring corresponding local attention features, and combining the local attention features to acquire an overall feature map;
and carrying out channel MLP processing on the integral feature map to obtain the local feature with discriminant.
6. The pedestrian re-recognition method based on the multi-granularity pyramid intersection network of claim 5, wherein the local attention characteristic is obtained by adopting the following formula:
Figure FDA0004139661580000021
wherein ,Yij For the local attention feature, σ (·) is the softmax activation function, Q ij In order to have a distinguishing characteristic of the scale,
Figure FDA0004139661580000022
for inputting the split local feature vector of the feature map, K i 、V i For inputting feature map->
Figure FDA0004139661580000023
The vector after the linear transformation is used for generating a vector,
Figure FDA0004139661580000024
for K i D is the embedding dimension, i represents the pyramid level, j=1, 2, …, m represents the index of the local feature.
7. The pedestrian re-recognition method based on the multi-granularity pyramid intersection network according to claim 5, wherein the channel MLP processing is implemented by adopting the following formula:
Z=ζ(Norm(Y)W 1 )×W 2 +Y
wherein Z is a distinctive feature with discriminant, ζ (·) is a Gelu activation function, norm (·) represents layer normalization, Y represents the overall feature map, W 1 ∈R d×τd and W2 ∈R τd×d Is a parameter that can be learned, τ is the expansion ratio.
8. The pedestrian re-recognition method based on the multi-granularity pyramid intersection network according to claim 1, wherein the pre-trained multi-granularity pyramid intersection network acquisition process comprises the following steps:
acquiring a training sample set, training the multi-granularity pyramid intersection network based on the training sample set, and acquiring the pre-trained multi-granularity pyramid intersection network after the function value of a loss function reaches a preset convergence condition;
wherein the loss function is based on validation loss, classification loss, and auxiliary loss acquisition.
9. The pedestrian re-recognition method based on the multi-granularity pyramid intersection network of claim 8, wherein the auxiliary loss is calculated by the following formula:
Figure FDA0004139661580000025
wherein ,
Figure FDA0004139661580000026
to aid in loss->
Figure FDA0004139661580000027
To verify loss->
Figure FDA0004139661580000028
For classification loss, S is the total number of stages.
10. The pedestrian re-recognition method based on the multi-granularity pyramid intersection network according to claim 1, wherein in each hierarchical aggregation unit, the multi-granularity convolution layer is sequentially connected with the pyramid intersection transform learning layer.
CN202310285479.7A 2023-03-22 2023-03-22 Pedestrian re-identification method based on multi-granularity pyramid intersection network Pending CN116229580A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310285479.7A CN116229580A (en) 2023-03-22 2023-03-22 Pedestrian re-identification method based on multi-granularity pyramid intersection network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310285479.7A CN116229580A (en) 2023-03-22 2023-03-22 Pedestrian re-identification method based on multi-granularity pyramid intersection network

Publications (1)

Publication Number Publication Date
CN116229580A true CN116229580A (en) 2023-06-06

Family

ID=86569507

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310285479.7A Pending CN116229580A (en) 2023-03-22 2023-03-22 Pedestrian re-identification method based on multi-granularity pyramid intersection network

Country Status (1)

Country Link
CN (1) CN116229580A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115571A (en) * 2023-10-25 2023-11-24 成都阿加犀智能科技有限公司 Fine-grained intelligent commodity identification method, device, equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115571A (en) * 2023-10-25 2023-11-24 成都阿加犀智能科技有限公司 Fine-grained intelligent commodity identification method, device, equipment and medium
CN117115571B (en) * 2023-10-25 2024-01-26 成都阿加犀智能科技有限公司 Fine-grained intelligent commodity identification method, device, equipment and medium

Similar Documents

Publication Publication Date Title
Cao et al. An attention enhanced bidirectional LSTM for early forest fire smoke recognition
Wu et al. Face detection with different scales based on faster R-CNN
Zuo et al. Learning contextual dependence with convolutional hierarchical recurrent neural networks
CN111709311B (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN113936339A (en) Fighting identification method and device based on double-channel cross attention mechanism
Ming et al. Simple triplet loss based on intra/inter-class metric learning for face verification
CN112163498B (en) Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method
CN112183468A (en) Pedestrian re-identification method based on multi-attention combined multi-level features
Yang et al. Diffusion model as representation learner
Jiang et al. An efficient attention module for 3d convolutional neural networks in action recognition
CN116311105B (en) Vehicle re-identification method based on inter-sample context guidance network
CN116721458A (en) Cross-modal time sequence contrast learning-based self-supervision action recognition method
An Pedestrian Re‐Recognition Algorithm Based on Optimization Deep Learning‐Sequence Memory Model
CN112836637A (en) Pedestrian re-identification method based on space reverse attention network
Wang et al. A novel multiface recognition method with short training time and lightweight based on ABASNet and H-softmax
CN116229580A (en) Pedestrian re-identification method based on multi-granularity pyramid intersection network
Xie et al. Object Re-identification Using Teacher-Like and Light Students.
Huang et al. Pedestrian detection using RetinaNet with multi-branch structure and double pooling attention mechanism
CN116597267B (en) Image recognition method, device, computer equipment and storage medium
Aakur et al. Action localization through continual predictive learning
Dong et al. A supervised dictionary learning and discriminative weighting model for action recognition
CN112613474A (en) Pedestrian re-identification method and device
Hu et al. Multi-manifolds discriminative canonical correlation analysis for image set-based face recognition
Wang et al. Image splicing tamper detection based on deep learning and attention mechanism
Luo et al. Cross-Domain Person Re-Identification Based on Feature Fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination