CN116434143A - Cross-modal pedestrian re-identification method and system based on feature reconstruction - Google Patents

Cross-modal pedestrian re-identification method and system based on feature reconstruction Download PDF

Info

Publication number
CN116434143A
CN116434143A CN202310406803.6A CN202310406803A CN116434143A CN 116434143 A CN116434143 A CN 116434143A CN 202310406803 A CN202310406803 A CN 202310406803A CN 116434143 A CN116434143 A CN 116434143A
Authority
CN
China
Prior art keywords
pedestrian
feature
cross
modal
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310406803.6A
Other languages
Chinese (zh)
Inventor
陈思
邱刘翔
王大寒
朱顺痣
吴芸
许华荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University of Technology
Original Assignee
Xiamen University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University of Technology filed Critical Xiamen University of Technology
Priority to CN202310406803.6A priority Critical patent/CN116434143A/en
Publication of CN116434143A publication Critical patent/CN116434143A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/457Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by analysing connectivity, e.g. edge linking, connected component analysis or slices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a cross-mode pedestrian re-identification method based on feature reconstruction, which comprises the following steps: 1) Extracting visible light pictures and infrared pictures of a plurality of pedestrians in pairs from the data set to form a visible light training data set and an infrared training data set; 2) Constructing a cross-modal pedestrian re-recognition network model based on feature reconstruction, wherein the cross-modal pedestrian re-recognition network model mainly comprises a specific feature extraction module, a multi-scale feature extraction module, a Token perception multi-scale feature fusion module and a cross-modal feature reconstruction module; training a cross-mode pedestrian re-recognition network model through a visible light training data set and an infrared training data set to obtain generalizable model parameters; 3) And using the trained cross-modal pedestrian re-recognition network model for cross-modal retrieval to realize cross-modal pedestrian re-recognition. The method and the system are beneficial to obtaining a more stable, robust and accurate cross-mode pedestrian re-identification result.

Description

Cross-modal pedestrian re-identification method and system based on feature reconstruction
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a cross-mode pedestrian re-identification method and system based on feature reconstruction.
Background
Pedestrian re-recognition is one of the key technologies of the intelligent video monitoring system, and aims to search out the same pedestrians in a non-intersecting camera system. The application scene of pedestrian re-identification is very wide, such as large-scale places like airports, malls, campuses and the like. Previously, there have been many efforts focused on pedestrian re-recognition tasks in visible light scenes, but not considering illumination changes in real scenes. The current camera system can be automatically switched to a visible light or infrared light mode according to real-time illumination conditions so as to ensure all-weather monitoring. Less work is focused on the task of cross-modal pedestrian re-recognition, which refers to finding infrared (visible light) images of the same pedestrians in an infrared (visible light) search library according to images to be searched for in visible light (infrared light). The cross-mode pedestrian re-recognition not only aims to solve the problems encountered by common pedestrian re-recognition such as variable pedestrian gestures, shielding, camera angle difference, background disorder, illumination change and the like, but also solves the problem of mode difference between images.
At present, the cross-mode pedestrian re-identification can be divided into two types: image-based methods and feature-based methods. For image-based methodsIn other words, it is intended to transfer one modality image to another modality image. Li et al (D.Li, X.Wei, X.Hong, Y.Gong, infrared-visible cross-model person re-identification with an X modality, in: proceedings oftheAAAI Conference onArtificial Intelligence,2020, pp.4610-4617.) designed a lightweight shared network for learning model cues in visible light pictures and then using these cues to generate intermediate modality images. Wang et al (Z.Wang, Z.Wang, Y.Zheng, Y.Chuang, S.Satoh, learning to reduce dual-level discrepancy for infrared-visible person re-identification, in: proceedings of the IEEE Conference on ComputerVision and Pattern Recognition,2019, pp.618-626.) propose D 2 The RL model extracts the mode information of pedestrians under different modes by adopting an anti-learning method, and then mutually transfers the learned mode information through a generation network to form an intermediate mode pedestrian picture which is used as an additional mode picture to be provided for the network for learning, so that the mode difference is reduced. However, by adopting the method of countermeasure learning, not only the generated pictures have the characteristics of discontinuity and loss of semantic information, but also the problem that the network is difficult to converge is generated.
For the method based on feature learning, the purpose is to learn the features shared by pedestrians in different modes, so that the negative influence caused by the mode difference is reduced. Currently, in order to be able to obtain robust pedestrian sharing features, many methods employ convolutional networks or transfomer networks as the underlying backbone networks. A simple but high performance network based on CNN with heterocenter loss was designed to reduce intra-class cross-modal differences, as in Zhu et al (Y.Zhu, Z.Yang, L.Wang, S.Zhao, X.Hu, D.Tao, hetero-center loss for cross-modality person re-identification, neuro-excitation 386 (2020) 97-109), to obtain pedestrian discrimination characteristics. Furthermore Liang et al (T.Liang, Y.Jin, Y.Gao, W.Liu, S.Feng, T.Wang, Y.Li, CMTR: cross-modality Transformer for visible-infrared person re-identification, arXiv preprint arXiv:2110.08994 (2021)) introduced a pure Transformer structure into Cross-modality pedestrian re-identification to discover features of distinguishing significance to the pedestrian. The hybrid model of convolution and Transfomer also compensates for the lack of long-range modeling capability of convolutional networks and the insensitivity of Transfomer to local features. Chen et al (C.Chen, M.Ye, M.Qi, J.Wu, J.Jiang, C.Lin, structure-aware positional Transformer for visible-infrared person re-identification, IEEE Trans. Image Process.31 (2022) 2352-2364.) propose a Structure-aware position transducer model SPOT in combination with CNN to explore the structural features of the human body under different modalities, so as to obtain the characteristics of unchanged modalities. However, the existing cross-mode pedestrian re-identification method is still deficient in multi-scale feature extraction. In addition, the connection of pedestrian characteristics in different modes is not well explored.
Disclosure of Invention
The invention aims to provide a cross-mode pedestrian re-identification method and system based on feature reconstruction, which are beneficial to obtaining a more stable, robust and accurate cross-mode pedestrian re-identification result.
In order to achieve the above purpose, the invention adopts the following technical scheme: a cross-mode pedestrian re-identification method based on feature reconstruction comprises the following steps:
1) Extracting visible light pictures and infrared pictures of a plurality of pedestrians in pairs from the data set to form a visible light training data set and an infrared training data set;
2) Constructing a cross-modal pedestrian re-recognition network model based on feature reconstruction, wherein the cross-modal pedestrian re-recognition network model mainly comprises a specific feature extraction module, a multi-scale feature extraction module, a Token perception multi-scale feature fusion module and a cross-modal feature reconstruction module; training a cross-mode pedestrian re-recognition network model through a visible light training data set and an infrared training data set to obtain generalizable model parameters;
3) And using the trained cross-modal pedestrian re-recognition network model for cross-modal retrieval to realize cross-modal pedestrian re-recognition.
Further, in step 1), the dataset is a RegDB cross-mode pedestrian re-recognition dataset, and M visible light pictures and M infrared pictures of N pedestrians are extracted from the RegDB cross-mode pedestrian re-recognition dataset in a paired manner.
Further, in step 2), the implementation method of the cross-mode pedestrian re-recognition network model is as follows:
a) The method comprises the steps that pedestrian features are extracted from an input visible light picture and an input infrared picture through independent specific feature extraction modules respectively, and then the extracted pedestrian features are input into a multi-scale feature extraction module at the same time;
b) The multi-scale feature extraction module is used for extracting multi-scale pedestrian features of the visible light picture and the infrared picture through a plurality of feature extraction modules with different scales;
c) The method comprises the steps that multi-scale pedestrian features are sent to a Token-perceived multi-scale feature fusion module, the Token-perceived multi-scale feature fusion module models the relation between the multi-scale pedestrian features by adopting bidirectional interaction from local and global view angles of a learnable Token sequence, and interference of pedestrian irrelevant features under different scales is reduced; repeating the local and global visual angle bidirectional interaction process for a plurality of times to obtain a final visible light and infrared multi-scale characteristic relation graph and a visible light and infrared Token sequence containing multi-scale information;
d) Combining the obtained multi-scale characteristic relation diagram with the original pedestrian characteristics, sending the multi-scale characteristic relation diagram to a last characteristic extraction module of a multi-scale characteristic extraction module for further characteristic learning, and then carrying out pooling and horizontal segmentation to obtain visible light and infrared global characteristics and local characteristics of pedestrians;
e) Inputting visible light and infrared global features and local features of pedestrians and visible light and infrared Token sequences containing multi-scale information into a cross-modal feature reconstruction module to reconstruct cross-modal features and discover the connection of the features of pedestrians under different modes;
f) In order to reduce noise generated by pedestrian features in the reconstruction process, feature reconstruction loss is constructed, loss calculation is performed on the reconstructed features and target modal features, and errors of the reconstructed features and the target modal features are minimized through an optimizer so as to enhance the connection of the features between the two modalities.
Further, in the step B), the multi-scale feature extraction module comprises four feature extraction modules Stage-1, stage-2, stage-3 and Stage-4; the pedestrian feature size extracted by the specific feature extraction module is 3×288×144, the feature map size is 256×72×36 after passing through the first feature extraction module Stage-1, the feature map size is 512×36×18 after passing through the second feature extraction module Stage-2, and the feature map size is 1024×18×9 after passing through the third feature extraction module Stage-3.
In the step C, the self-adaptive pooling is utilized to unify and then splice different scale features of pedestrians, and the bidirectional mixed structure of convolution and Transformer is utilized to model the multi-scale features of the pedestrians, so that the interference of the independent features of the pedestrians in different scales is reduced; utilizing a learnable Token sequence to perform relation mining on the multi-scale characteristics of pedestrians under local and global view angles; pedestrian multiscale feature M for visible light vis The process of turning from a local view to a global view is expressed as:
T′ vis =LN(FFN(MHA(T,FL(M vis ),FL(M vis ))))+T)
wherein T is represented as a learnable Token sequence, the number is set to be 6, FL represents operation of flattening three-dimensional pedestrian characteristics into two-dimensional characteristics, MHA represents a multi-head attention mechanism, FFN represents a forward feedback operation, and LN represents a layer normalization operation;
the process of turning from the global view to the local view is expressed as:
M′ vis =Conv(RS(MHA(FL(M vis ),T′ vis ,T′ vis )+M vis )
where Conv denotes a convolution operation and RS denotes an operation of converting a two-dimensional feature into a three-dimensional feature.
Further, in step E), the implementation method of the cross-modal feature reconstruction module is as follows:
utilizing the visible light and infrared Token sequence T 'containing pedestrian multi-scale information obtained in the step C)' vis Visible light and infrared global features for pedestrians
Figure BDA0004181714530000041
Local feature->
Figure BDA0004181714530000042
Reconstructing to enhance the relationship between two modal features, the cross-modal reconstruction of the global feature resulting in a feature->
Figure BDA0004181714530000043
Expressed as:
Figure BDA0004181714530000044
wherein Attn represents the mechanism of attention, T' ir[0] Representing the use of the infrared Token sequence of the first pedestrian, W Qh
Figure BDA0004181714530000045
And->
Figure BDA0004181714530000046
Representing the conversion of the corresponding features into a Query, key and Value matrix; similarly, & gt, in the above formula>
Figure BDA0004181714530000047
Replaced by
Figure BDA0004181714530000048
Obtaining the local feature +.>
Figure BDA0004181714530000049
Further, in the step F), the specific method for constructing the feature reconstruction loss is as follows: calculating the difference between the reconstructed pedestrian characteristics and the target characteristics to obtain characteristic reconstruction loss
Figure BDA00041817145300000410
Updating the network model with an optimizer, expressed as:
Figure BDA00041817145300000411
wherein L1 represents Manhattan distance, N p Representing the number of pedestrian feature level cuts.
The invention also provides a cross-mode pedestrian re-identification system based on characteristic reconstruction, which comprises a memory, a processor and computer program instructions which are stored on the memory and can be run by the processor, wherein the computer program instructions can realize the steps of the method when the processor runs the computer program instructions.
Compared with the prior art, the invention has the following beneficial effects: the method and the system effectively utilize multi-scale feature learning and cross-modal feature reconstruction, can obtain generalized and robust pedestrian features, can effectively solve the problems of posture change and object shielding, and can also relieve the negative influence of model performance reduction caused by modal difference.
Drawings
FIG. 1 is a schematic diagram of a cross-modality pedestrian re-recognition network model based on feature reconstruction in an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
The embodiment provides a cross-mode pedestrian re-identification method based on feature reconstruction, which comprises the following steps:
1) And extracting visible light pictures and infrared pictures of a plurality of pedestrians in pairs from the data set to form a visible light training data set and an infrared training data set.
2) The method comprises the steps of constructing a cross-modal pedestrian re-recognition network model based on feature reconstruction, wherein the cross-modal pedestrian re-recognition network model mainly comprises a specific feature extraction module, a multi-scale feature extraction module, a Token perception multi-scale feature fusion module and a cross-modal feature reconstruction module, and the architecture of the cross-modal pedestrian re-recognition network model is shown in figure 1. And training the cross-modal pedestrian re-recognition network model through the visible light training data set and the infrared training data set to obtain generalizable model parameters.
3) And using the trained cross-modal pedestrian re-recognition network model for cross-modal retrieval to realize cross-modal pedestrian re-recognition.
In the step 1), the data set is a RegDB cross-mode pedestrian re-identification data set, and M visible light pictures and M infrared pictures of N pedestrians are extracted from the RegDB cross-mode pedestrian re-identification data set in a paired mode.
In the step 2), the implementation method of the cross-mode pedestrian re-identification network model comprises the following steps:
a) And respectively extracting pedestrian features of the input visible light pictures and infrared pictures through independent specific feature extraction modules, and then inputting the extracted pedestrian features into a multi-scale feature extraction module at the same time.
B) The multi-scale feature extraction module is used for extracting the multi-scale pedestrian features of the visible light picture and the infrared picture through a plurality of feature extraction modules with different scales.
C) The method comprises the steps that multi-scale pedestrian features are sent to a Token-perceived multi-scale feature fusion module, the Token-perceived multi-scale feature fusion module models the relationship between the multi-scale pedestrian features by adopting a small amount of learnable Token sequences from local and global view angles in a bidirectional interaction mode, and interference of pedestrian irrelevant features under different scales is reduced; repeating the local and global visual angle bidirectional interaction process for a plurality of times to obtain a final visible light and infrared multi-scale characteristic relation diagram and a visible light and infrared Token sequence containing multi-scale information.
D) Combining the obtained multi-scale characteristic relation diagram with the original pedestrian characteristics, sending the multi-scale characteristic relation diagram to a last characteristic extraction module of the multi-scale characteristic extraction module for further characteristic learning, and then carrying out pooling and horizontal segmentation to obtain visible light and infrared global characteristics and local characteristics of pedestrians.
E) And inputting visible light and infrared global features and local features of pedestrians and visible light and infrared Token sequences containing multi-scale information into a cross-modal feature reconstruction module to reconstruct the cross-modal features and discover the connection of the features of pedestrians under different modes.
F) In order to reduce noise generated by pedestrian features in the reconstruction process, feature reconstruction loss is constructed, loss calculation is performed on the reconstructed features and target modal features, and errors of the reconstructed features and the target modal features are minimized through an optimizer so as to enhance the connection of the features between the two modalities.
In the step B), the multi-scale feature extraction module comprises four feature extraction modules Stage-1, stage-2, stage-3 and Stage-4; the pedestrian feature size extracted by the specific feature extraction module is 3×288×144, the feature map size is 256×72×36 after passing through the first feature extraction module Stage-1, the feature map size is 512×36×18 after passing through the second feature extraction module Stage-2, and the feature map size is 1024×18×9 after passing through the third feature extraction module Stage-3.
In the step C, the adaptive pooling is utilized to scale and unify the different scale features of pedestrians, then the features are spliced, the bidirectional mixed structure of convolution and Transformer is utilized to model the multi-scale features of the pedestrians, and the interference of the independent features of the pedestrians under different scales is reduced. And utilizing the learnable Token sequence to perform relation discovery under local and global view angles on the multi-scale characteristics of the pedestrians. Pedestrian multiscale feature M in visible light vis For example, the process of turning from a local view to a global view is expressed as:
T′ vLs =LN(FFN(MHA(T,FL(M VLs ),FL(M VLs ))))+T)
wherein T is denoted as a learnable Token sequence, the number is set to 6, FL represents an operation of flattening a three-dimensional pedestrian feature into a two-dimensional feature, MHA represents a multi-head attention mechanism, FFN represents a feed-forward operation, and LN represents a layer normalization operation.
The process of turning from the global view to the local view is expressed as:
M′ vis =Conv(RS(MHA(FL(M vis ),T′ vis ,T′ vis )+M vis )
where Conv denotes a convolution operation and RS denotes an operation of converting a two-dimensional feature into a three-dimensional feature.
In step E), the implementation method of the cross-modal feature reconstruction module is as follows:
utilizing the visible light and infrared Token sequence T 'containing pedestrian multi-scale information obtained in the step C)' vis Visible light and infrared global features for pedestrians
Figure BDA0004181714530000071
Local feature->
Figure BDA0004181714530000072
Reconstructing to enhance the relationship between two modal features, the cross-modal reconstruction of the global feature resulting in a feature->
Figure BDA0004181714530000073
Expressed as:
Figure BDA0004181714530000074
wherein Attn represents the mechanism of attention, T' ir[0] Indicating the use of the infrared Token sequence for the first pedestrian,
Figure BDA00041817145300000713
Figure BDA0004181714530000075
and->
Figure BDA0004181714530000076
Representing the conversion of the corresponding features into a Query, key and Value matrix; similarly, & gt, in the above formula>
Figure BDA0004181714530000077
Replaced by->
Figure BDA0004181714530000078
Obtaining the local feature +.>
Figure BDA0004181714530000079
In the step F), the specific method for constructing the characteristic reconstruction loss is as follows: calculating the difference between the reconstructed pedestrian characteristics and the target characteristics to obtain characteristic reconstruction loss
Figure BDA00041817145300000710
Updating the network model with an optimizer, expressed as:
Figure BDA00041817145300000711
wherein L1 represents Manhattan distance, N p Representing the number of pedestrian feature level cuts.
The embodiment also provides a cross-mode pedestrian re-identification system based on feature reconstruction, which comprises a memory, a processor and computer program instructions stored on the memory and capable of being run by the processor, wherein the computer program instructions can realize the method steps when the processor runs the computer program instructions.
In this embodiment, the RegDB dataset is adopted to perform comparison verification under the setting of searching infrared pictures by using visible light pictures of pedestrians, and table 1 shows the comparison result of the method proposed by the invention on the RegDB dataset and other cross-mode pedestrian re-recognition methods. As can be seen from Table 1, the method of the invention has higher accuracy and robustness compared with other cross-mode pedestrian re-recognition methods, and is embodied as Rank-1 and mAP best.
TABLE 1
Figure BDA00041817145300000712
Figure BDA0004181714530000081
In Table 1, MAUM corresponds to the method proposed by J.Liu et al (J.Liu, Y.Sun, F.Zhu, H.Pei, Y.Yang, W.Li, learning memory-augmented unidirectional metrics for cross-modality person re-identification, in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2022, pp.19366-19375.)
MPANet corresponds to the method proposed by Q.Wu et al (Q.Wu, P.Dai, J.Chen, C.Lin, Y.Wu, F.Huang, B.Zhong, R.Ji, discover cross-modality nuances for visible-infrared person re-identification, in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2021, pp.4330-4339.)
NFS corresponds to the method proposed by Y.Chen et al (Y.Chen, L.Wan, Z.Li, Q.Jing, Z.Sun, neural feature search for RGBinfrared person re-identification, in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2021, pp.587-597.)
SPOT corresponds to the method proposed by C.Chen et al (C.Chen, M.Ye, M.Qi, J.Wu, J.Jiang, C.Lin, structure-aware positional Transformer for visible-infrared person re-identification, IEEE Trans. Image Process.31 (2022) 2352-2364.)
AGW corresponds to the method proposed by M.Ye et al (M.Ye, J.Shen, G.Lin, T.Xiang, L.Shao, S.C.Hoi, deep learning for person re-identification: A survey and outlook, IEEE Trans. Pattern Anal. Mach. Intel 44 (6) (2022) 2872-2893.)
DDAG corresponds to the method proposed by M.Ye et al (M.Ye, J.Shen, D.J Crandall, L.Shao, J.Luo, dynamic Dual-attentive aggregation learning for visible-infrared person re-identification, in: proceedings of the European Conference on Computer Vision,2020, pp.229-247.)
D-HSME corresponds to the method proposed by Y.Hao et al (Y.Hao, N.Wang, J.Li, X.Gao, HSME: hypersphere manifold embedding for visible thermal person re-identification, in: proceedings of the AAAI conference on artificial intelligence,2019, pp.8385-8392.)
MSPAC corresponds to the method proposed by C.zhang et al (C.Zhang, H.Liu, W.Guo, M.Ye, multi-scale cascading network with compact feature learning for RGB-infrared person re-identification, in: proceedings ofthe IEEE International Conference on Pattern Recognition,2021, pp.8679-8686.)
CMGN corresponds to the method proposed by J.Jiang et al (J.Jiang, K.Jin, M.Qi, Q.Wang, J.Wu, C.Chen, across-modular multi-granularity attention network for RGB-IRperson re-identification, neuroomutting 406 (2020) 59-67.)
SDL corresponds to the method proposed by K.Kansal et al (K.Kansal, A.V.Subramanyam, Z.Wang, S.Satoh, SDL: spectrumdisentangled representation learning for visible-infraredperson re-identification, IEEE Trans. Circuits System. Video technology.30 (10) (2020) 3422-3432.)
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims (8)

1. The cross-mode pedestrian re-identification method based on feature reconstruction is characterized by comprising the following steps of:
1) Extracting visible light pictures and infrared pictures of a plurality of pedestrians in pairs from the data set to form a visible light training data set and an infrared training data set;
2) Constructing a cross-modal pedestrian re-recognition network model based on feature reconstruction, wherein the cross-modal pedestrian re-recognition network model mainly comprises a specific feature extraction module, a multi-scale feature extraction module, a Token perception multi-scale feature fusion module and a cross-modal feature reconstruction module; training a cross-mode pedestrian re-recognition network model through a visible light training data set and an infrared training data set to obtain generalizable model parameters;
3) And using the trained cross-modal pedestrian re-recognition network model for cross-modal retrieval to realize cross-modal pedestrian re-recognition.
2. The cross-mode pedestrian re-recognition method based on feature reconstruction according to claim 1, wherein in step 1), the dataset is a RegDB cross-mode pedestrian re-recognition dataset, and M visible light pictures and M infrared pictures of N pedestrians are extracted from the RegDB cross-mode pedestrian re-recognition dataset in a paired manner.
3. The method for identifying the cross-modal pedestrian re-based on the feature reconstruction according to claim 1, wherein in the step 2), the method for implementing the cross-modal pedestrian re-identification network model is as follows:
a) The method comprises the steps that pedestrian features are extracted from an input visible light picture and an input infrared picture through independent specific feature extraction modules respectively, and then the extracted pedestrian features are input into a multi-scale feature extraction module at the same time;
b) The multi-scale feature extraction module is used for extracting multi-scale pedestrian features of the visible light picture and the infrared picture through a plurality of feature extraction modules with different scales;
c) The method comprises the steps that multi-scale pedestrian features are sent to a Token-perceived multi-scale feature fusion module, the Token-perceived multi-scale feature fusion module models the relation between the multi-scale pedestrian features by adopting bidirectional interaction from local and global view angles of a learnable Token sequence, and interference of pedestrian irrelevant features under different scales is reduced; repeating the local and global visual angle bidirectional interaction process for a plurality of times to obtain a final visible light and infrared multi-scale characteristic relation graph and a visible light and infrared Token sequence containing multi-scale information;
d) Combining the obtained multi-scale characteristic relation diagram with the original pedestrian characteristics, sending the multi-scale characteristic relation diagram to a last characteristic extraction module of a multi-scale characteristic extraction module for further characteristic learning, and then carrying out pooling and horizontal segmentation to obtain visible light and infrared global characteristics and local characteristics of pedestrians;
e) Inputting visible light and infrared global features and local features of pedestrians and visible light and infrared Token sequences containing multi-scale information into a cross-modal feature reconstruction module to reconstruct cross-modal features and discover the connection of the features of pedestrians under different modes;
f) In order to reduce noise generated by pedestrian features in the reconstruction process, feature reconstruction loss is constructed, loss calculation is performed on the reconstructed features and target modal features, and errors of the reconstructed features and the target modal features are minimized through an optimizer so as to enhance the connection of the features between the two modalities.
4. The cross-modal pedestrian re-recognition method based on feature reconstruction as claimed in claim 3, wherein in the step B), the multi-scale feature extraction module includes four feature extraction modules Stage-1, stage-2, stage-3, stage-4; the pedestrian feature size extracted by the specific feature extraction module is 3×288×144, the feature map size is 256×72×36 after passing through the first feature extraction module Stage-1, the feature map size is 512×36×18 after passing through the second feature extraction module Stage-2, and the feature map size is 1024×18×9 after passing through the third feature extraction module Stage-3.
5. The method for identifying the cross-modal pedestrian based on the feature reconstruction according to claim 3, wherein in the step C), the self-adaptive pooling is utilized to unify and then splice different scale features of the pedestrian, the two-way mixed structure of convolution and Transformer is utilized to model the multi-scale features of the pedestrian, and the interference of the pedestrian irrelevant features under different scales is reduced; utilizing a learnable Token sequence to perform relation mining on the multi-scale characteristics of pedestrians under local and global view angles; pedestrian multiscale feature M for visible light vis The process of turning from a local view to a global view is expressed as:
T′ vis =LN*FFN(MHA(T,FL(M vis ),FL(M vis ))))+T)
wherein T is represented as a learnable Token sequence, the number is set to be 6, FL represents operation of flattening three-dimensional pedestrian characteristics into two-dimensional characteristics, MHA represents a multi-head attention mechanism, FFN represents a forward feedback operation, and LN represents a layer normalization operation;
the process of turning from the global view to the local view is expressed as:
M′ vis =Conv(RS(MHA(FL(M vis ),T vis ,T vis )+M vis )
where Conv denotes a convolution operation and RS denotes an operation of converting a two-dimensional feature into a three-dimensional feature.
6. The method for identifying the cross-modal pedestrian based on the feature reconstruction according to claim 3, wherein in the step E), the method for implementing the cross-modal feature reconstruction module is as follows:
utilizing the visible light and infrared Token sequence T 'containing pedestrian multi-scale information obtained in the step C)' vis Visible light and infrared global features for pedestrians
Figure FDA0004181714520000021
Local feature->
Figure FDA0004181714520000022
Reconstructing to enhance the relationship between two modal features, the cross-modal reconstruction of the global feature resulting in a feature->
Figure FDA0004181714520000023
Expressed as:
Figure FDA0004181714520000024
wherein Attn represents the mechanism of attention, T' ir[0] Indicating the use of the first pedestrianIs a sequence of infrared Token of (c),
Figure FDA0004181714520000025
Figure FDA0004181714520000026
and->
Figure FDA0004181714520000027
Representing the conversion of the corresponding features into a Query, key and Value matrix; similarly, & gt, in the above formula>
Figure FDA0004181714520000028
Replaced by->
Figure FDA0004181714520000029
Obtaining the local feature +.>
Figure FDA00041817145200000210
7. The cross-modal pedestrian re-recognition method based on feature reconstruction of claim 3, wherein in the step F), the specific method for constructing the feature reconstruction loss is as follows: calculating the difference between the reconstructed pedestrian characteristics and the target characteristics to obtain characteristic reconstruction loss
Figure FDA0004181714520000031
Updating the network model with an optimizer, expressed as:
Figure FDA0004181714520000032
wherein L1 represents Manhattan distance, N p Representing the number of pedestrian feature level cuts.
8. A cross-modality pedestrian re-recognition system based on feature reconstruction, comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, which when executed by the processor, are capable of implementing the method steps of any one of claims 1 to 7.
CN202310406803.6A 2023-04-17 2023-04-17 Cross-modal pedestrian re-identification method and system based on feature reconstruction Pending CN116434143A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310406803.6A CN116434143A (en) 2023-04-17 2023-04-17 Cross-modal pedestrian re-identification method and system based on feature reconstruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310406803.6A CN116434143A (en) 2023-04-17 2023-04-17 Cross-modal pedestrian re-identification method and system based on feature reconstruction

Publications (1)

Publication Number Publication Date
CN116434143A true CN116434143A (en) 2023-07-14

Family

ID=87085064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310406803.6A Pending CN116434143A (en) 2023-04-17 2023-04-17 Cross-modal pedestrian re-identification method and system based on feature reconstruction

Country Status (1)

Country Link
CN (1) CN116434143A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118072252A (en) * 2024-04-17 2024-05-24 武汉大学 Pedestrian re-recognition model training method suitable for arbitrary multi-mode data combination

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118072252A (en) * 2024-04-17 2024-05-24 武汉大学 Pedestrian re-recognition model training method suitable for arbitrary multi-mode data combination

Similar Documents

Publication Publication Date Title
JP6745328B2 (en) Method and apparatus for recovering point cloud data
CN113240179B (en) Method and system for predicting orbital pedestrian flow by fusing spatio-temporal information
CN110135249B (en) Human behavior identification method based on time attention mechanism and LSTM (least Square TM)
Qin et al. Monogrnet: A general framework for monocular 3d object detection
Wang et al. Storm: Structure-based overlap matching for partial point cloud registration
Yang et al. Spatio-temporal domain awareness for multi-agent collaborative perception
Liu et al. Skeleton-based human action recognition via large-kernel attention graph convolutional network
KR102305230B1 (en) Method and device for improving accuracy of boundary information from image
CN116434143A (en) Cross-modal pedestrian re-identification method and system based on feature reconstruction
Zhong et al. No pain, big gain: classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces
Qin et al. PointSkelCNN: Deep Learning‐Based 3D Human Skeleton Extraction from Point Clouds
Wu et al. HPGCN: Hierarchical poselet-guided graph convolutional network for 3D pose estimation
Wani et al. Deep learning-based video action recognition: a review
Tong et al. Edge-assisted epipolar transformer for industrial scene reconstruction
Wu et al. Deep learning for LiDAR-only and LiDAR-fusion 3D perception: A survey
Zhou et al. Retrieval and localization with observation constraints
Lei et al. Recent advances in multi-modal 3D scene understanding: A comprehensive survey and evaluation
CN117115911A (en) Hypergraph learning action recognition system based on attention mechanism
Zhang et al. Exploring semantic information extraction from different data forms in 3D point cloud semantic segmentation
Mahjoub et al. A flexible high-level fusion for an accurate human action recognition system
Zhang et al. Dyna-depthformer: Multi-frame transformer for self-supervised depth estimation in dynamic scenes
Miao et al. Pseudo-lidar for visual odometry
Escalera et al. Guest editors’ introduction to the special issue on multimodal human pose recovery and behavior analysis
Xu et al. MRFTrans: Multimodal Representation Fusion Transformer for monocular 3D semantic scene completion
Dai et al. An investigation of gcn-based human action recognition using skeletal features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination