CN116434143A - Cross-modal pedestrian re-identification method and system based on feature reconstruction - Google Patents
Cross-modal pedestrian re-identification method and system based on feature reconstruction Download PDFInfo
- Publication number
- CN116434143A CN116434143A CN202310406803.6A CN202310406803A CN116434143A CN 116434143 A CN116434143 A CN 116434143A CN 202310406803 A CN202310406803 A CN 202310406803A CN 116434143 A CN116434143 A CN 116434143A
- Authority
- CN
- China
- Prior art keywords
- pedestrian
- feature
- cross
- modal
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 238000000605 extraction Methods 0.000 claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 20
- 230000004927 fusion Effects 0.000 claims abstract description 10
- 230000008447 perception Effects 0.000 claims abstract description 4
- 230000008569 process Effects 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 14
- 238000010586 diagram Methods 0.000 claims description 14
- 230000002457 bidirectional effect Effects 0.000 claims description 8
- 230000003993 interaction Effects 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 claims description 3
- 238000005065 mining Methods 0.000 claims description 2
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 238000003909 pattern recognition Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000005286 illumination Methods 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/53—Recognition of crowd images, e.g. recognition of crowd congestion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/457—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by analysing connectivity, e.g. edge linking, connected component analysis or slices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a cross-mode pedestrian re-identification method based on feature reconstruction, which comprises the following steps: 1) Extracting visible light pictures and infrared pictures of a plurality of pedestrians in pairs from the data set to form a visible light training data set and an infrared training data set; 2) Constructing a cross-modal pedestrian re-recognition network model based on feature reconstruction, wherein the cross-modal pedestrian re-recognition network model mainly comprises a specific feature extraction module, a multi-scale feature extraction module, a Token perception multi-scale feature fusion module and a cross-modal feature reconstruction module; training a cross-mode pedestrian re-recognition network model through a visible light training data set and an infrared training data set to obtain generalizable model parameters; 3) And using the trained cross-modal pedestrian re-recognition network model for cross-modal retrieval to realize cross-modal pedestrian re-recognition. The method and the system are beneficial to obtaining a more stable, robust and accurate cross-mode pedestrian re-identification result.
Description
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a cross-mode pedestrian re-identification method and system based on feature reconstruction.
Background
Pedestrian re-recognition is one of the key technologies of the intelligent video monitoring system, and aims to search out the same pedestrians in a non-intersecting camera system. The application scene of pedestrian re-identification is very wide, such as large-scale places like airports, malls, campuses and the like. Previously, there have been many efforts focused on pedestrian re-recognition tasks in visible light scenes, but not considering illumination changes in real scenes. The current camera system can be automatically switched to a visible light or infrared light mode according to real-time illumination conditions so as to ensure all-weather monitoring. Less work is focused on the task of cross-modal pedestrian re-recognition, which refers to finding infrared (visible light) images of the same pedestrians in an infrared (visible light) search library according to images to be searched for in visible light (infrared light). The cross-mode pedestrian re-recognition not only aims to solve the problems encountered by common pedestrian re-recognition such as variable pedestrian gestures, shielding, camera angle difference, background disorder, illumination change and the like, but also solves the problem of mode difference between images.
At present, the cross-mode pedestrian re-identification can be divided into two types: image-based methods and feature-based methods. For image-based methodsIn other words, it is intended to transfer one modality image to another modality image. Li et al (D.Li, X.Wei, X.Hong, Y.Gong, infrared-visible cross-model person re-identification with an X modality, in: proceedings oftheAAAI Conference onArtificial Intelligence,2020, pp.4610-4617.) designed a lightweight shared network for learning model cues in visible light pictures and then using these cues to generate intermediate modality images. Wang et al (Z.Wang, Z.Wang, Y.Zheng, Y.Chuang, S.Satoh, learning to reduce dual-level discrepancy for infrared-visible person re-identification, in: proceedings of the IEEE Conference on ComputerVision and Pattern Recognition,2019, pp.618-626.) propose D 2 The RL model extracts the mode information of pedestrians under different modes by adopting an anti-learning method, and then mutually transfers the learned mode information through a generation network to form an intermediate mode pedestrian picture which is used as an additional mode picture to be provided for the network for learning, so that the mode difference is reduced. However, by adopting the method of countermeasure learning, not only the generated pictures have the characteristics of discontinuity and loss of semantic information, but also the problem that the network is difficult to converge is generated.
For the method based on feature learning, the purpose is to learn the features shared by pedestrians in different modes, so that the negative influence caused by the mode difference is reduced. Currently, in order to be able to obtain robust pedestrian sharing features, many methods employ convolutional networks or transfomer networks as the underlying backbone networks. A simple but high performance network based on CNN with heterocenter loss was designed to reduce intra-class cross-modal differences, as in Zhu et al (Y.Zhu, Z.Yang, L.Wang, S.Zhao, X.Hu, D.Tao, hetero-center loss for cross-modality person re-identification, neuro-excitation 386 (2020) 97-109), to obtain pedestrian discrimination characteristics. Furthermore Liang et al (T.Liang, Y.Jin, Y.Gao, W.Liu, S.Feng, T.Wang, Y.Li, CMTR: cross-modality Transformer for visible-infrared person re-identification, arXiv preprint arXiv:2110.08994 (2021)) introduced a pure Transformer structure into Cross-modality pedestrian re-identification to discover features of distinguishing significance to the pedestrian. The hybrid model of convolution and Transfomer also compensates for the lack of long-range modeling capability of convolutional networks and the insensitivity of Transfomer to local features. Chen et al (C.Chen, M.Ye, M.Qi, J.Wu, J.Jiang, C.Lin, structure-aware positional Transformer for visible-infrared person re-identification, IEEE Trans. Image Process.31 (2022) 2352-2364.) propose a Structure-aware position transducer model SPOT in combination with CNN to explore the structural features of the human body under different modalities, so as to obtain the characteristics of unchanged modalities. However, the existing cross-mode pedestrian re-identification method is still deficient in multi-scale feature extraction. In addition, the connection of pedestrian characteristics in different modes is not well explored.
Disclosure of Invention
The invention aims to provide a cross-mode pedestrian re-identification method and system based on feature reconstruction, which are beneficial to obtaining a more stable, robust and accurate cross-mode pedestrian re-identification result.
In order to achieve the above purpose, the invention adopts the following technical scheme: a cross-mode pedestrian re-identification method based on feature reconstruction comprises the following steps:
1) Extracting visible light pictures and infrared pictures of a plurality of pedestrians in pairs from the data set to form a visible light training data set and an infrared training data set;
2) Constructing a cross-modal pedestrian re-recognition network model based on feature reconstruction, wherein the cross-modal pedestrian re-recognition network model mainly comprises a specific feature extraction module, a multi-scale feature extraction module, a Token perception multi-scale feature fusion module and a cross-modal feature reconstruction module; training a cross-mode pedestrian re-recognition network model through a visible light training data set and an infrared training data set to obtain generalizable model parameters;
3) And using the trained cross-modal pedestrian re-recognition network model for cross-modal retrieval to realize cross-modal pedestrian re-recognition.
Further, in step 1), the dataset is a RegDB cross-mode pedestrian re-recognition dataset, and M visible light pictures and M infrared pictures of N pedestrians are extracted from the RegDB cross-mode pedestrian re-recognition dataset in a paired manner.
Further, in step 2), the implementation method of the cross-mode pedestrian re-recognition network model is as follows:
a) The method comprises the steps that pedestrian features are extracted from an input visible light picture and an input infrared picture through independent specific feature extraction modules respectively, and then the extracted pedestrian features are input into a multi-scale feature extraction module at the same time;
b) The multi-scale feature extraction module is used for extracting multi-scale pedestrian features of the visible light picture and the infrared picture through a plurality of feature extraction modules with different scales;
c) The method comprises the steps that multi-scale pedestrian features are sent to a Token-perceived multi-scale feature fusion module, the Token-perceived multi-scale feature fusion module models the relation between the multi-scale pedestrian features by adopting bidirectional interaction from local and global view angles of a learnable Token sequence, and interference of pedestrian irrelevant features under different scales is reduced; repeating the local and global visual angle bidirectional interaction process for a plurality of times to obtain a final visible light and infrared multi-scale characteristic relation graph and a visible light and infrared Token sequence containing multi-scale information;
d) Combining the obtained multi-scale characteristic relation diagram with the original pedestrian characteristics, sending the multi-scale characteristic relation diagram to a last characteristic extraction module of a multi-scale characteristic extraction module for further characteristic learning, and then carrying out pooling and horizontal segmentation to obtain visible light and infrared global characteristics and local characteristics of pedestrians;
e) Inputting visible light and infrared global features and local features of pedestrians and visible light and infrared Token sequences containing multi-scale information into a cross-modal feature reconstruction module to reconstruct cross-modal features and discover the connection of the features of pedestrians under different modes;
f) In order to reduce noise generated by pedestrian features in the reconstruction process, feature reconstruction loss is constructed, loss calculation is performed on the reconstructed features and target modal features, and errors of the reconstructed features and the target modal features are minimized through an optimizer so as to enhance the connection of the features between the two modalities.
Further, in the step B), the multi-scale feature extraction module comprises four feature extraction modules Stage-1, stage-2, stage-3 and Stage-4; the pedestrian feature size extracted by the specific feature extraction module is 3×288×144, the feature map size is 256×72×36 after passing through the first feature extraction module Stage-1, the feature map size is 512×36×18 after passing through the second feature extraction module Stage-2, and the feature map size is 1024×18×9 after passing through the third feature extraction module Stage-3.
In the step C, the self-adaptive pooling is utilized to unify and then splice different scale features of pedestrians, and the bidirectional mixed structure of convolution and Transformer is utilized to model the multi-scale features of the pedestrians, so that the interference of the independent features of the pedestrians in different scales is reduced; utilizing a learnable Token sequence to perform relation mining on the multi-scale characteristics of pedestrians under local and global view angles; pedestrian multiscale feature M for visible light vis The process of turning from a local view to a global view is expressed as:
T′ vis =LN(FFN(MHA(T,FL(M vis ),FL(M vis ))))+T)
wherein T is represented as a learnable Token sequence, the number is set to be 6, FL represents operation of flattening three-dimensional pedestrian characteristics into two-dimensional characteristics, MHA represents a multi-head attention mechanism, FFN represents a forward feedback operation, and LN represents a layer normalization operation;
the process of turning from the global view to the local view is expressed as:
M′ vis =Conv(RS(MHA(FL(M vis ),T′ vis ,T′ vis )+M vis )
where Conv denotes a convolution operation and RS denotes an operation of converting a two-dimensional feature into a three-dimensional feature.
Further, in step E), the implementation method of the cross-modal feature reconstruction module is as follows:
utilizing the visible light and infrared Token sequence T 'containing pedestrian multi-scale information obtained in the step C)' vis Visible light and infrared global features for pedestriansLocal feature->Reconstructing to enhance the relationship between two modal features, the cross-modal reconstruction of the global feature resulting in a feature->Expressed as:
wherein Attn represents the mechanism of attention, T' ir[0] Representing the use of the infrared Token sequence of the first pedestrian, W Qh 、And->Representing the conversion of the corresponding features into a Query, key and Value matrix; similarly, & gt, in the above formula>Replaced byObtaining the local feature +.>
Further, in the step F), the specific method for constructing the feature reconstruction loss is as follows: calculating the difference between the reconstructed pedestrian characteristics and the target characteristics to obtain characteristic reconstruction lossUpdating the network model with an optimizer, expressed as:
wherein L1 represents Manhattan distance, N p Representing the number of pedestrian feature level cuts.
The invention also provides a cross-mode pedestrian re-identification system based on characteristic reconstruction, which comprises a memory, a processor and computer program instructions which are stored on the memory and can be run by the processor, wherein the computer program instructions can realize the steps of the method when the processor runs the computer program instructions.
Compared with the prior art, the invention has the following beneficial effects: the method and the system effectively utilize multi-scale feature learning and cross-modal feature reconstruction, can obtain generalized and robust pedestrian features, can effectively solve the problems of posture change and object shielding, and can also relieve the negative influence of model performance reduction caused by modal difference.
Drawings
FIG. 1 is a schematic diagram of a cross-modality pedestrian re-recognition network model based on feature reconstruction in an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
The embodiment provides a cross-mode pedestrian re-identification method based on feature reconstruction, which comprises the following steps:
1) And extracting visible light pictures and infrared pictures of a plurality of pedestrians in pairs from the data set to form a visible light training data set and an infrared training data set.
2) The method comprises the steps of constructing a cross-modal pedestrian re-recognition network model based on feature reconstruction, wherein the cross-modal pedestrian re-recognition network model mainly comprises a specific feature extraction module, a multi-scale feature extraction module, a Token perception multi-scale feature fusion module and a cross-modal feature reconstruction module, and the architecture of the cross-modal pedestrian re-recognition network model is shown in figure 1. And training the cross-modal pedestrian re-recognition network model through the visible light training data set and the infrared training data set to obtain generalizable model parameters.
3) And using the trained cross-modal pedestrian re-recognition network model for cross-modal retrieval to realize cross-modal pedestrian re-recognition.
In the step 1), the data set is a RegDB cross-mode pedestrian re-identification data set, and M visible light pictures and M infrared pictures of N pedestrians are extracted from the RegDB cross-mode pedestrian re-identification data set in a paired mode.
In the step 2), the implementation method of the cross-mode pedestrian re-identification network model comprises the following steps:
a) And respectively extracting pedestrian features of the input visible light pictures and infrared pictures through independent specific feature extraction modules, and then inputting the extracted pedestrian features into a multi-scale feature extraction module at the same time.
B) The multi-scale feature extraction module is used for extracting the multi-scale pedestrian features of the visible light picture and the infrared picture through a plurality of feature extraction modules with different scales.
C) The method comprises the steps that multi-scale pedestrian features are sent to a Token-perceived multi-scale feature fusion module, the Token-perceived multi-scale feature fusion module models the relationship between the multi-scale pedestrian features by adopting a small amount of learnable Token sequences from local and global view angles in a bidirectional interaction mode, and interference of pedestrian irrelevant features under different scales is reduced; repeating the local and global visual angle bidirectional interaction process for a plurality of times to obtain a final visible light and infrared multi-scale characteristic relation diagram and a visible light and infrared Token sequence containing multi-scale information.
D) Combining the obtained multi-scale characteristic relation diagram with the original pedestrian characteristics, sending the multi-scale characteristic relation diagram to a last characteristic extraction module of the multi-scale characteristic extraction module for further characteristic learning, and then carrying out pooling and horizontal segmentation to obtain visible light and infrared global characteristics and local characteristics of pedestrians.
E) And inputting visible light and infrared global features and local features of pedestrians and visible light and infrared Token sequences containing multi-scale information into a cross-modal feature reconstruction module to reconstruct the cross-modal features and discover the connection of the features of pedestrians under different modes.
F) In order to reduce noise generated by pedestrian features in the reconstruction process, feature reconstruction loss is constructed, loss calculation is performed on the reconstructed features and target modal features, and errors of the reconstructed features and the target modal features are minimized through an optimizer so as to enhance the connection of the features between the two modalities.
In the step B), the multi-scale feature extraction module comprises four feature extraction modules Stage-1, stage-2, stage-3 and Stage-4; the pedestrian feature size extracted by the specific feature extraction module is 3×288×144, the feature map size is 256×72×36 after passing through the first feature extraction module Stage-1, the feature map size is 512×36×18 after passing through the second feature extraction module Stage-2, and the feature map size is 1024×18×9 after passing through the third feature extraction module Stage-3.
In the step C, the adaptive pooling is utilized to scale and unify the different scale features of pedestrians, then the features are spliced, the bidirectional mixed structure of convolution and Transformer is utilized to model the multi-scale features of the pedestrians, and the interference of the independent features of the pedestrians under different scales is reduced. And utilizing the learnable Token sequence to perform relation discovery under local and global view angles on the multi-scale characteristics of the pedestrians. Pedestrian multiscale feature M in visible light vis For example, the process of turning from a local view to a global view is expressed as:
T′ vLs =LN(FFN(MHA(T,FL(M VLs ),FL(M VLs ))))+T)
wherein T is denoted as a learnable Token sequence, the number is set to 6, FL represents an operation of flattening a three-dimensional pedestrian feature into a two-dimensional feature, MHA represents a multi-head attention mechanism, FFN represents a feed-forward operation, and LN represents a layer normalization operation.
The process of turning from the global view to the local view is expressed as:
M′ vis =Conv(RS(MHA(FL(M vis ),T′ vis ,T′ vis )+M vis )
where Conv denotes a convolution operation and RS denotes an operation of converting a two-dimensional feature into a three-dimensional feature.
In step E), the implementation method of the cross-modal feature reconstruction module is as follows:
utilizing the visible light and infrared Token sequence T 'containing pedestrian multi-scale information obtained in the step C)' vis Visible light and infrared global features for pedestriansLocal feature->Reconstructing to enhance the relationship between two modal features, the cross-modal reconstruction of the global feature resulting in a feature->Expressed as:
wherein Attn represents the mechanism of attention, T' ir[0] Indicating the use of the infrared Token sequence for the first pedestrian, and->Representing the conversion of the corresponding features into a Query, key and Value matrix; similarly, & gt, in the above formula>Replaced by->Obtaining the local feature +.>
In the step F), the specific method for constructing the characteristic reconstruction loss is as follows: calculating the difference between the reconstructed pedestrian characteristics and the target characteristics to obtain characteristic reconstruction lossUpdating the network model with an optimizer, expressed as:
wherein L1 represents Manhattan distance, N p Representing the number of pedestrian feature level cuts.
The embodiment also provides a cross-mode pedestrian re-identification system based on feature reconstruction, which comprises a memory, a processor and computer program instructions stored on the memory and capable of being run by the processor, wherein the computer program instructions can realize the method steps when the processor runs the computer program instructions.
In this embodiment, the RegDB dataset is adopted to perform comparison verification under the setting of searching infrared pictures by using visible light pictures of pedestrians, and table 1 shows the comparison result of the method proposed by the invention on the RegDB dataset and other cross-mode pedestrian re-recognition methods. As can be seen from Table 1, the method of the invention has higher accuracy and robustness compared with other cross-mode pedestrian re-recognition methods, and is embodied as Rank-1 and mAP best.
TABLE 1
In Table 1, MAUM corresponds to the method proposed by J.Liu et al (J.Liu, Y.Sun, F.Zhu, H.Pei, Y.Yang, W.Li, learning memory-augmented unidirectional metrics for cross-modality person re-identification, in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2022, pp.19366-19375.)
MPANet corresponds to the method proposed by Q.Wu et al (Q.Wu, P.Dai, J.Chen, C.Lin, Y.Wu, F.Huang, B.Zhong, R.Ji, discover cross-modality nuances for visible-infrared person re-identification, in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2021, pp.4330-4339.)
NFS corresponds to the method proposed by Y.Chen et al (Y.Chen, L.Wan, Z.Li, Q.Jing, Z.Sun, neural feature search for RGBinfrared person re-identification, in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2021, pp.587-597.)
SPOT corresponds to the method proposed by C.Chen et al (C.Chen, M.Ye, M.Qi, J.Wu, J.Jiang, C.Lin, structure-aware positional Transformer for visible-infrared person re-identification, IEEE Trans. Image Process.31 (2022) 2352-2364.)
AGW corresponds to the method proposed by M.Ye et al (M.Ye, J.Shen, G.Lin, T.Xiang, L.Shao, S.C.Hoi, deep learning for person re-identification: A survey and outlook, IEEE Trans. Pattern Anal. Mach. Intel 44 (6) (2022) 2872-2893.)
DDAG corresponds to the method proposed by M.Ye et al (M.Ye, J.Shen, D.J Crandall, L.Shao, J.Luo, dynamic Dual-attentive aggregation learning for visible-infrared person re-identification, in: proceedings of the European Conference on Computer Vision,2020, pp.229-247.)
D-HSME corresponds to the method proposed by Y.Hao et al (Y.Hao, N.Wang, J.Li, X.Gao, HSME: hypersphere manifold embedding for visible thermal person re-identification, in: proceedings of the AAAI conference on artificial intelligence,2019, pp.8385-8392.)
MSPAC corresponds to the method proposed by C.zhang et al (C.Zhang, H.Liu, W.Guo, M.Ye, multi-scale cascading network with compact feature learning for RGB-infrared person re-identification, in: proceedings ofthe IEEE International Conference on Pattern Recognition,2021, pp.8679-8686.)
CMGN corresponds to the method proposed by J.Jiang et al (J.Jiang, K.Jin, M.Qi, Q.Wang, J.Wu, C.Chen, across-modular multi-granularity attention network for RGB-IRperson re-identification, neuroomutting 406 (2020) 59-67.)
SDL corresponds to the method proposed by K.Kansal et al (K.Kansal, A.V.Subramanyam, Z.Wang, S.Satoh, SDL: spectrumdisentangled representation learning for visible-infraredperson re-identification, IEEE Trans. Circuits System. Video technology.30 (10) (2020) 3422-3432.)
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present invention still fall within the protection scope of the technical solution of the present invention.
Claims (8)
1. The cross-mode pedestrian re-identification method based on feature reconstruction is characterized by comprising the following steps of:
1) Extracting visible light pictures and infrared pictures of a plurality of pedestrians in pairs from the data set to form a visible light training data set and an infrared training data set;
2) Constructing a cross-modal pedestrian re-recognition network model based on feature reconstruction, wherein the cross-modal pedestrian re-recognition network model mainly comprises a specific feature extraction module, a multi-scale feature extraction module, a Token perception multi-scale feature fusion module and a cross-modal feature reconstruction module; training a cross-mode pedestrian re-recognition network model through a visible light training data set and an infrared training data set to obtain generalizable model parameters;
3) And using the trained cross-modal pedestrian re-recognition network model for cross-modal retrieval to realize cross-modal pedestrian re-recognition.
2. The cross-mode pedestrian re-recognition method based on feature reconstruction according to claim 1, wherein in step 1), the dataset is a RegDB cross-mode pedestrian re-recognition dataset, and M visible light pictures and M infrared pictures of N pedestrians are extracted from the RegDB cross-mode pedestrian re-recognition dataset in a paired manner.
3. The method for identifying the cross-modal pedestrian re-based on the feature reconstruction according to claim 1, wherein in the step 2), the method for implementing the cross-modal pedestrian re-identification network model is as follows:
a) The method comprises the steps that pedestrian features are extracted from an input visible light picture and an input infrared picture through independent specific feature extraction modules respectively, and then the extracted pedestrian features are input into a multi-scale feature extraction module at the same time;
b) The multi-scale feature extraction module is used for extracting multi-scale pedestrian features of the visible light picture and the infrared picture through a plurality of feature extraction modules with different scales;
c) The method comprises the steps that multi-scale pedestrian features are sent to a Token-perceived multi-scale feature fusion module, the Token-perceived multi-scale feature fusion module models the relation between the multi-scale pedestrian features by adopting bidirectional interaction from local and global view angles of a learnable Token sequence, and interference of pedestrian irrelevant features under different scales is reduced; repeating the local and global visual angle bidirectional interaction process for a plurality of times to obtain a final visible light and infrared multi-scale characteristic relation graph and a visible light and infrared Token sequence containing multi-scale information;
d) Combining the obtained multi-scale characteristic relation diagram with the original pedestrian characteristics, sending the multi-scale characteristic relation diagram to a last characteristic extraction module of a multi-scale characteristic extraction module for further characteristic learning, and then carrying out pooling and horizontal segmentation to obtain visible light and infrared global characteristics and local characteristics of pedestrians;
e) Inputting visible light and infrared global features and local features of pedestrians and visible light and infrared Token sequences containing multi-scale information into a cross-modal feature reconstruction module to reconstruct cross-modal features and discover the connection of the features of pedestrians under different modes;
f) In order to reduce noise generated by pedestrian features in the reconstruction process, feature reconstruction loss is constructed, loss calculation is performed on the reconstructed features and target modal features, and errors of the reconstructed features and the target modal features are minimized through an optimizer so as to enhance the connection of the features between the two modalities.
4. The cross-modal pedestrian re-recognition method based on feature reconstruction as claimed in claim 3, wherein in the step B), the multi-scale feature extraction module includes four feature extraction modules Stage-1, stage-2, stage-3, stage-4; the pedestrian feature size extracted by the specific feature extraction module is 3×288×144, the feature map size is 256×72×36 after passing through the first feature extraction module Stage-1, the feature map size is 512×36×18 after passing through the second feature extraction module Stage-2, and the feature map size is 1024×18×9 after passing through the third feature extraction module Stage-3.
5. The method for identifying the cross-modal pedestrian based on the feature reconstruction according to claim 3, wherein in the step C), the self-adaptive pooling is utilized to unify and then splice different scale features of the pedestrian, the two-way mixed structure of convolution and Transformer is utilized to model the multi-scale features of the pedestrian, and the interference of the pedestrian irrelevant features under different scales is reduced; utilizing a learnable Token sequence to perform relation mining on the multi-scale characteristics of pedestrians under local and global view angles; pedestrian multiscale feature M for visible light vis The process of turning from a local view to a global view is expressed as:
T′ vis =LN*FFN(MHA(T,FL(M vis ),FL(M vis ))))+T)
wherein T is represented as a learnable Token sequence, the number is set to be 6, FL represents operation of flattening three-dimensional pedestrian characteristics into two-dimensional characteristics, MHA represents a multi-head attention mechanism, FFN represents a forward feedback operation, and LN represents a layer normalization operation;
the process of turning from the global view to the local view is expressed as:
M′ vis =Conv(RS(MHA(FL(M vis ),T vis ,T vis )+M vis )
where Conv denotes a convolution operation and RS denotes an operation of converting a two-dimensional feature into a three-dimensional feature.
6. The method for identifying the cross-modal pedestrian based on the feature reconstruction according to claim 3, wherein in the step E), the method for implementing the cross-modal feature reconstruction module is as follows:
utilizing the visible light and infrared Token sequence T 'containing pedestrian multi-scale information obtained in the step C)' vis Visible light and infrared global features for pedestriansLocal feature->Reconstructing to enhance the relationship between two modal features, the cross-modal reconstruction of the global feature resulting in a feature->Expressed as:
wherein Attn represents the mechanism of attention, T' ir[0] Indicating the use of the first pedestrianIs a sequence of infrared Token of (c), and->Representing the conversion of the corresponding features into a Query, key and Value matrix; similarly, & gt, in the above formula>Replaced by->Obtaining the local feature +.>
7. The cross-modal pedestrian re-recognition method based on feature reconstruction of claim 3, wherein in the step F), the specific method for constructing the feature reconstruction loss is as follows: calculating the difference between the reconstructed pedestrian characteristics and the target characteristics to obtain characteristic reconstruction lossUpdating the network model with an optimizer, expressed as:
wherein L1 represents Manhattan distance, N p Representing the number of pedestrian feature level cuts.
8. A cross-modality pedestrian re-recognition system based on feature reconstruction, comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, which when executed by the processor, are capable of implementing the method steps of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310406803.6A CN116434143A (en) | 2023-04-17 | 2023-04-17 | Cross-modal pedestrian re-identification method and system based on feature reconstruction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310406803.6A CN116434143A (en) | 2023-04-17 | 2023-04-17 | Cross-modal pedestrian re-identification method and system based on feature reconstruction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116434143A true CN116434143A (en) | 2023-07-14 |
Family
ID=87085064
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310406803.6A Pending CN116434143A (en) | 2023-04-17 | 2023-04-17 | Cross-modal pedestrian re-identification method and system based on feature reconstruction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116434143A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118072252A (en) * | 2024-04-17 | 2024-05-24 | 武汉大学 | Pedestrian re-recognition model training method suitable for arbitrary multi-mode data combination |
-
2023
- 2023-04-17 CN CN202310406803.6A patent/CN116434143A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118072252A (en) * | 2024-04-17 | 2024-05-24 | 武汉大学 | Pedestrian re-recognition model training method suitable for arbitrary multi-mode data combination |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6745328B2 (en) | Method and apparatus for recovering point cloud data | |
CN113240179B (en) | Method and system for predicting orbital pedestrian flow by fusing spatio-temporal information | |
CN110135249B (en) | Human behavior identification method based on time attention mechanism and LSTM (least Square TM) | |
Qin et al. | Monogrnet: A general framework for monocular 3d object detection | |
Wang et al. | Storm: Structure-based overlap matching for partial point cloud registration | |
Yang et al. | Spatio-temporal domain awareness for multi-agent collaborative perception | |
Liu et al. | Skeleton-based human action recognition via large-kernel attention graph convolutional network | |
KR102305230B1 (en) | Method and device for improving accuracy of boundary information from image | |
CN116434143A (en) | Cross-modal pedestrian re-identification method and system based on feature reconstruction | |
Zhong et al. | No pain, big gain: classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces | |
Qin et al. | PointSkelCNN: Deep Learning‐Based 3D Human Skeleton Extraction from Point Clouds | |
Wu et al. | HPGCN: Hierarchical poselet-guided graph convolutional network for 3D pose estimation | |
Wani et al. | Deep learning-based video action recognition: a review | |
Tong et al. | Edge-assisted epipolar transformer for industrial scene reconstruction | |
Wu et al. | Deep learning for LiDAR-only and LiDAR-fusion 3D perception: A survey | |
Zhou et al. | Retrieval and localization with observation constraints | |
Lei et al. | Recent advances in multi-modal 3D scene understanding: A comprehensive survey and evaluation | |
CN117115911A (en) | Hypergraph learning action recognition system based on attention mechanism | |
Zhang et al. | Exploring semantic information extraction from different data forms in 3D point cloud semantic segmentation | |
Mahjoub et al. | A flexible high-level fusion for an accurate human action recognition system | |
Zhang et al. | Dyna-depthformer: Multi-frame transformer for self-supervised depth estimation in dynamic scenes | |
Miao et al. | Pseudo-lidar for visual odometry | |
Escalera et al. | Guest editors’ introduction to the special issue on multimodal human pose recovery and behavior analysis | |
Xu et al. | MRFTrans: Multimodal Representation Fusion Transformer for monocular 3D semantic scene completion | |
Dai et al. | An investigation of gcn-based human action recognition using skeletal features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |