CN116798070A

CN116798070A - Cross-mode pedestrian re-recognition method based on spectrum sensing and attention mechanism

Info

Publication number: CN116798070A
Application number: CN202310537794.4A
Authority: CN
Inventors: 葛斌; 许诺; 夏晨星
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-09-22

Abstract

The invention relates to a cross-mode pedestrian re-recognition method based on a spectrum sensing and attention mechanism, and aims to solve the problem of cross-mode difference between RGB images and infrared images in the field of pedestrian re-recognition. According to the invention, a group of input images consisting of visible light, infrared and gray images obtained by homogeneous enhancement are additionally added, and the network model is subjected to joint training by using the two groups of input images, so that the utilization of the characteristics in the limited images is further enhanced, and the model matching precision is improved; the method is characterized in that a mode of combining a single-flow network and a double-flow network is adopted to extract and combine the characteristics of images of different modes; the trimodal feature learning is then solved from the point of view of multimodal classification and multiview retrieval. And finally, acquiring richer local features of the pedestrian image by using an attention mechanism to realize interactive fusion of modal information, so as to improve the inter-modal pedestrian re-recognition effect.

Description

Cross-mode pedestrian re-recognition method based on spectrum sensing and attention mechanism

Technical Field

The invention relates to the field of computer vision, in particular to a cross-mode pedestrian re-identification method based on a spectrum sensing and attention mechanism by means of the existing deep learning technology.

Background

Pedestrian Re-identification (Re-ID), also known as pedestrian Re-identification, is a typical image retrieval problem to retrieve a target pedestrian image from a given library of pedestrian images collected across devices. The pedestrian re-recognition technology can make up for the visual limitation of the face recognition technology and the fixed camera in the fields of intelligent security, video monitoring and the like, and can be combined with the pedestrian detection and pedestrian tracking technology to form a pedestrian re-recognition system.

Most of the current work focuses on the problem of pedestrian re-identification under a visible light camera. However, in the practical application scenario, the camera should ensure all-weather operation. As the visible light camera has limited effect on night monitoring security work, with the development of technology, cameras capable of switching infrared modes are being widely applied to intelligent monitoring systems. In the visible light mode and the infrared mode, an RGB image and an infrared image are acquired, respectively. This is data belonging to two different modalities, so the problem of cross-modality pedestrian re-recognition is raised, with consequent attention. The method effectively solves the problem of the re-recognition of the cross-mode pedestrians, and has great significance in public safety, crime prevention, criminal investigation, and the like.

The cross-mode pedestrian re-identification mainly aims at researching the problem that a visible light image or an infrared image of a specific individual is given, and the image library in two modes is tried to be searched for matching images belonging to the same individual. The difficulty is how to reduce the difference between the two modal images well during modeling.

Under the background, a cross-mode pedestrian re-recognition method based on spectrum sensing and attention mechanisms is provided, so that the problem of cross-mode can be effectively solved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a cross-mode pedestrian re-recognition method based on a spectrum sensing and attention mechanism, and aims to solve the problem of cross-mode difference between RGB images and infrared images in the field of pedestrian re-recognition.

In order to achieve the above object, the present invention is realized by the following technical scheme: the cross-mode pedestrian re-identification method based on the spectrum sensing and attention mechanism comprises the following steps of:

1. acquiring a cross-mode training data set with identity labels, wherein each training sample in the training data set comprises an infrared image and an RGB image corresponding to the identity; generating a gray image through a visible light image for data enhancement; and adding a group of input images consisting of visible light, infrared and gray images obtained by homogeneous enhancement, and performing joint training on the network model by using the two groups of input images.

2. Constructing a backbone network based on Resnet-50 as a feature extractor, and adopting a mode of combining a single-flow network and a double-flow network to extract and combine features of images of different modes; the trimodal feature learning is then solved from the point of view of multimodal classification and multiview retrieval.

3. An intra-mode weighting aggregation module (IWPA) cooperates with the three attention modules in a multi-module and multi-level embedding mode to capture multi-scale context information, so that obvious detail features are reserved, interdependence among the features is enhanced, and potential relations among pedestrian images are excavated;

4. the feature representations of different modality images belonging to the same identity are reciprocal. A cross-modal graph structure (CGSA) learns structural relationships between two modalities to strengthen the feature representation;

specifically, the step 1 includes:

1.1 The neural network model inputs two groups of images, one group is formed by combining visible light and infrared images, and the other group is formed by visible light images, infrared images and gray images generated by the visible light images. Each training batch contains n visible images, n uniformly enhanced gray scale images, and n infrared images. The two groups of pictures participate in the calculation of label loss and triplet loss, so that limited image resources are fully utilized to learn more excellent image characteristics. For graying of the visible light image, pixels of the R, G, B channels of the visible light are directly accumulated, so that enhancement data are obtained, training time is not increased basically in a picture generation mode, noise is not introduced additionally in a data enhancement mode, and therefore the model training effect is reduced.

Specifically, the step 2 includes:

2.1 Using Resnet-50 as backbone network, and adopting double-flow network for learning the characteristic of distinguishing between different mode images in early stage; the network parameters of the convolution modules in the different streams are independent so that low-level modal features with a particular morphology can be better captured. Since the visible light image and the gray scale image are more similar in structural information, the visible light image and the gray scale image are sent into the same network stream for learning. The infrared image is more unique in character than the previous two images, thus allowing it to flow through a network alone. In order to learn the characteristics which can be shared between different modalities, a convolution module behind the network is subjected to parameter sharing.

2.2 For multi-modal classification, through parameter sharing network learning multi-modal sharing identity classifier, homogeneous and heterogeneous recognition losses are designed, and the identity invariable representation of information mining sharing is realized. For multi-view retrieval, weighted hexa-directional ordering loss is developed to optimize relative distance across multiple modalities. To enhance robustness to modal changes, homogeneous invariant regularization is introduced. The main idea is that the features of the original visible image and the uniformly enhanced gray image remain unchanged after being extracted through a feature network. Specifically, a smooth L1 loss is employed as regularization:

wherein: b represents the current batch image set;a feature vector obtained by a neural network of a visible light image with the identity label i is represented; />And the characteristic vector is obtained by the gray enhanced image with the identity label i through a network.

Specifically, the step 3 includes:

3.1 Attention mechanism suppression by utilizing intra-modal polymer characterization (IWPA) moduleIrrelevant background noise, concern pedestrian-related features. The module is divided into two branches, one of which contains a Channel Attention Module (CAM), and the other branch of a Spatial Attention Module (SAM) is a Position Attention Module (PAM). The input features are given by first X passing through a Channel Attention Module (CAM) in the first branch, resulting in a feature map X _C Then, through a Spatial Attention Module (SAM), a feature map X is obtained _S The method comprises the steps of carrying out a first treatment on the surface of the In another branch, X passes through a Position Attention Module (PAM) to obtain a feature map X _P The method comprises the steps of carrying out a first treatment on the surface of the Finally mapping the obtained characteristic to X _S ，X _P Adding, the final feature map is: x is X _OUT ＝X _S +X _P

Specifically, the step 4 includes:

4.1 In feature matching, the self-adaptive graph structure is utilized to aggregate pedestrian features belonging to the same individual among the modes, the context information with the same identity is combined with the relation among the cross-mode images, and finally output feature vectors are obtained by learning the relation among two different modes, wherein the self-adaptive graph structure is expressed as follows:

wherein A is ^g For normalizing the adjacency matrix undirected graph, l _i And/l _j The graph nodes and the corresponding independent thermal codes are respectively I| _K Is a matrix formed by itself.

4.2 Dynamic multi-attention-focused learning strategy is to consider pixel-level partial focused feature learning as a main penalty and then gradually add the learning image-level global feature learning penalty for optimization. The trained objective loss function is expressed as: l=l _b +L _e Wherein, the method comprises the steps of, wherein,

L _b ＝L _tir +L _id

L _tir to cross-modal triplet loss, L _id For identity loss, e is the number of training times,average loss value representing previous training round,/->Learning a loss value representing the intra-polymer characteristic of the current round mode,/->Representing the structural constraint value of the current round cross-modal adaptive graph.

The invention has the beneficial effects that:

1. the invention uses two groups of input images to carry out joint training on the network model by additionally adding a group of input consisting of visible light, infrared and gray images obtained by homogeneous enhancement. Further strengthens the utilization of the characteristics in the limited images, improves the precision of model matching,

2. and by utilizing the attention mechanism, the richer local features of the pedestrian image are obtained to realize the interactive fusion of the modal information, so that the inter-modal pedestrian re-recognition effect is improved.

3. By adopting the effective measures, the invention can greatly improve the retrieval rate of the cross-mode pedestrian re-identification.

Description of the drawings:

FIG. 1 is a cross-modal pedestrian re-recognition model framework illustration in an embodiment;

FIG. 2 is a schematic diagram of an aggregate feature attention mechanism module in an embodiment;

fig. 3 is a schematic illustration of a position attention module in an embodiment.

The specific embodiment is as follows:

the present invention will be described in further detail with reference to the drawings and the technical scheme, in order to make the objects, technical schemes and advantages of the present invention more apparent.

The cross-modal pedestrian re-recognition model in the embodiment of the present disclosure, as shown in fig. 1, includes: two groups of images for model input, one group consisting of visible light and infrared images, and the other group consisting of visible light images, infrared images, and gray scale images generated from visible light images. The ResNet-50 backbone network for feature extraction, the aggregate feature attention module for mining the local features of the model, see FIG. 2, and the cross-modal adaptive graph structure module for interactive fusion of feature information in different modalities.

The method comprises the following steps:

1. the neural network model inputs two groups of images, one of which is composed of visible light and infrared images, and the other of which is composed of visible light images, infrared images and gray images generated by the visible light images. Each training batch contains n visible images, n uniformly enhanced gray scale images, and n infrared images. For graying of the visible light image, pixels of the R, G, B channels of the visible light are directly accumulated, so that enhancement data are obtained, training time is not increased basically in a picture generation mode, noise is not introduced additionally in a data enhancement mode, and therefore the model training effect is reduced.

And 2, inputting two groups of images by the ResNet-50 network model to participate in the calculation of label loss and triplet loss, and fully utilizing limited image resources to learn more excellent image characteristics. The network adopts a mode of combining a single-flow network and a double-flow network to extract and combine the characteristics of images of different modes. In order to learn the characteristic of distinguishing between images of different modes in early stage, a double-flow network is adopted; the network parameters of the convolution modules in the different streams are independent so that low-level modal features with a particular morphology can be better captured. Since the visible light image and the gray scale image are more similar in structural information, the visible light image and the gray scale image are sent into the same network stream for learning. The infrared image is more unique in character than the previous two images, thus allowing it to flow through a network alone. The tri-modal feature learning is then solved from the point of view of multi-modal classification and multi-view retrieval. For multi-modal classification, through a parameter sharing network learning multi-modal sharing identity classifier, homogeneous and heterogeneous recognition losses are designed, and the identity invariable representation of information mining sharing is realized. For multi-view retrieval, weighted hexa-directional ordering loss is developed to optimize relative distance across multiple modalities.

2.1 For learning the characteristics that can be shared between different modalities, the convolution modules behind the network are parameter-shared. After the convolution features are acquired and pooled by global averaging, a shared batch normalization layer is added to learn the shared feature embedding. For graying of the visible light image, pixels of the R, G, B three channels of visible light are directly accumulated, so that enhancement data are obtained.

2.2 Modal shared identity classifier learns a shared classifier θ for three different modal features ^p 。Indicating the use of theta ^p The classifier characterizes visible light images>Predictive imaging tag y _i Is a function of the output probability of (a). Similarly, the->And->The grayscale image and the infrared image features are represented, respectively, where { v, r, g } represents the index of the modality. Assuming that each training batch contains n visible light images, n uniformly enhanced gray scale images, and n infrared images, the label loss is expressed as follows:

2.3 To enhance robustness to modal changes, homogeneous invariant regularization is introduced. The main idea is that the features of the original visible image and the uniformly enhanced gray image remain unchanged after being extracted through a feature network. Specifically, a smooth L1 loss is employed as regularization:

wherein: b represents the current batch image set;a feature vector obtained by a neural network of a visible light image with the identity label i is represented; />And the characteristic vector is obtained by the gray enhanced image with the identity label i through a network. This fraction of total loss is called double homoplasmy and heterorecognition loss, L _dhhi ＝L _d +L _t +L _r The DHHI fully utilizes limited picture resources through two sets of inputs, and learns more full image features.

2.4 To optimize the relationship between cross-modal multi-view searches (visible-infrared-gray scale), a weighted six-way triplet ordering (WSDR) penalty usage is designed for multi-view searchesTo express +.>And->The euclidean distance between two samples, where the lower { i, j, k } represents the index of the image. Firstly, the visible light mode is regarded as an anchoring mode, then the positive mode is searched from the infrared mode, and the negative mode is searched from the gray mode. Formally assume +.>Is an anchor visible sample, if it meets the following constraint, the triplet +.>

For anchor pointsThe strategy selects the furthest visible infrared positive pair and selects the nearest visible gray negative pair, and the angle shapes sequenced from the cross view form the mined information triplet ∈ ->Wherein->Respectively represent->The corresponding infrared positive and gray negative pairs. In the same way +.>In general, the triplet loss for visible-infrared-gray scale, which contains the margin parameter ρ, is defined by the formula:

from the ordering point of view, information triplets of infrared-gray-visible relation can be mined, and the information triplets are composed ofAnd->Expressed in gray-visible-infrared relationship by +.>And->And (3) representing. The six-way triplet ordering penalty is:

2.5 SDR loss treats all the mined triples equally, and the single loss calculated by the triples contributes equally to the total loss; but the contribution of each triplet should be different due to the different sample types. To embody the contribution of each triplet, a triplet weighting strategy is adopted, usingRepresents triplet->Is used for the weight of the (c),represents triplet->The specific calculation process is as follows:

wherein [] ₊ =max (·, 0). Likewise, the calculated distance differences may be used to calculate triplet weights, normalized for each triplet weight, expressed as:

the normalized triplet weights weigh the importance of each mined triplet to the overall learning objective, where triples that are more important to network learning will be assigned a greater weight, which is also dynamically updated as the network optimizes. At the same time, this triplet weight also reflects the relationship of each triplet to all other mined triples, and it provides additional monitoring of the similarity within the triples. Using normalized triplet weights, the calculation formula for WSDR loss is as follows:

this partial loss is only used on the second set of features containing 3 modal images, where the normal triplet loss is used to calculate for the first set of bimodal input images, denoted as L _tir 。

3. In the task of re-identifying pedestrians, the attention mechanism can restrain irrelevant background noise and pay attention to relevant characteristics of the pedestrians. However, when the image variance between the two modes is large, the network model is interfered by noise, so that the recognition accuracy of pedestrian re-recognition is affected. To solve this problem, in the present embodiment, an intra-modality weighted aggregation (IWPA) module is utilized, which coordinates three kinds of injection in a multi-module and multi-level embedding mannerAnd the meaning module is used for capturing multi-scale context information so as to further reserve obvious detail characteristics. As shown in fig. 2, the module is split into two branches, one of which contains a Channel Attention Module (CAM), a Spatial Attention Module (SAM), and the other branch is a Position Attention Module (PAM). The input features are given by first X passing through a Channel Attention Module (CAM) in the first branch, resulting in a feature map X _C Then, through a Spatial Attention Module (SAM), a feature map X is obtained _S The method comprises the steps of carrying out a first treatment on the surface of the In another branch, X passes through a Position Attention Module (PAM) to obtain a feature map X _P The method comprises the steps of carrying out a first treatment on the surface of the Finally mapping the obtained characteristic to X _S ，X _P Adding, the final feature map is: x is X _OUT ＝X _S +X _P

3.1 The attention mechanism can reduce the interference caused by irrelevant information under the condition of introducing a small amount of parameters, thereby improving the distinguishing property of pedestrian characteristics. For this purpose, we have devised a channel attention module to obtain the importance of each channel. The channel attention module comprises two pooling layers, two full connection layers and a sigmoid layer, and is given with an input characteristic diagramWherein C, H, W respectively represent the channel number of the feature map and the height and width of the feature map. Attention profile generated by the attention module +.>The method comprises the following steps:

A _c ＝σ[W ₁ F _MP ；W ₂ F _AP ]

wherein, [;]representing a splice operation along a channel; sigma represents a sigmoid function; f (F) _MP ,F _AP Respectively representing feature mapping after global maximum pooling and global average pooling;respectively, the parameters of the fully connected layers, wherein r represents the dimension reduction ratio. In obtaining the channel attention characteristic diagram A _c Thereafter, the input features are mapped with A _c Multiplication to obtainFinal output feature map X to the channel attention module _C ＝X⊙A _C The corresponding elements are multiplied by the letter.

3.2 Spatial attention features enable the network to suppress background interference information while focusing on the most significant regional features in the pedestrian image. The spatial attention module proposed herein contains two pooling layers (global max pooling and global average pooling), one convolution layer of convolution kernel size and one sigmoid layer. Given an input feature mapWherein C, H, W respectively represent the channel number of the feature map and the height and width of the feature map. Attention feature map generated by the attention moduleThe method comprises the following steps:

wherein, [;]representing a splice operation along a channel;representing a 1 x 1 convolution, σ representing a sigmoid function; f (F) _MP ,F _AP Respectively representing feature mapping after global maximum pooling and global average pooling; /> Respectively, the parameters of the fully connected layers, wherein r represents the dimension reduction ratio. In obtaining the space attention characteristic diagram A _S Thereafter, the input features are mapped with A _S Multiplication to obtain final output characteristic mapping X of the channel attention module _C ＝X⊙A _S The corresponding elements are multiplied by the letter.

3.3 Position attention features enabling the integration of differential feature informationThe principle constraint improves the capability of the network to mine similar feature information so as to obtain richer local features, as shown in figure 3, and the module comprises three convolution layers with convolution kernel size of 1×1, namely m (, n (, o (), a normalization layer (BN), a learnable weight vector W ^p . Given a feature mapWherein C represents the channel dimension, and H and W represent the feature map size, respectively. Position attention profile A _p The following is shown:

wherein a is _i,j Refers to the effect of the i position on the j position;respectively represent the mapping of featuresDividing the characteristic diagram into p non-overlapping parts and then carrying out convolution; />And->Multiplication to obtain a local attention profile +.>W ^p A learnable weight vector representing different parts.

3.3.1 At the time of obtaining a position attention map A _p Then, the input feature map is subjected to global self-adaptive pooling and normalization operation and then is matched with A _p Adding, the final output feature of the position attention module maps to:

X _p ＝B(X ^o )+A _p

wherein X is ^o Representative ofThe global adaptive pooling output of the input feature map X, B (g) is a batch normalization operation. The cross-modality pedestrian re-recognition dataset typically contains many wrongly annotated images or image pairs with large visual differences between the two modalities, which makes it difficult to mine the distinguishing local features, disrupting the optimization process.

4. The self-adaptive graph structure constraint module is introduced to solve the problem of cross-mode pedestrian re-identification, and is mainly used for aggregating pedestrian samples belonging to the same individual among modes.

Wherein, the adaptive graph structure is expressed as:

4.1 Drawing force may measure the importance of a single node to nodes in another modality. The input node characteristics are represented herein by the output of the pooling layer. The graph attention coefficient is shown as follows:

wherein Γ (·) represents the LeakyRelu operation, (,) represents the conjunctive operation, h (·) represents a transformation matrix for reducing the input node characteristic dimension C to a transformation matrix of d, which is set to 256, w in the experiment ^g ∈i ^2d×1 Representing a learnable weight vector for measuring the importance of different feature dimensions in a series of features, A ^g Is an undirected graph of normalized adjacency matrices. Combining the relationship between contextual information and cross-modality images with the same identity can be used to enhance the characterization. By learning the relation between two different modes and combining the structural relation of the two modes crossing the modes, the characteristic representation is enhanced, and finally the output characteristic representation is as follows:

where m represents the number of samples of the current input lot and C represents the characteristic dimension of the last pooling layer output.

4.2 In order to better solve the problem of large difference of pedestrian image characteristics in the cross-modal pedestrian re-recognition task, the intra-modal weighted aggregation module and the cross-modal graphical structured attention module can be integrated. Since these two parts focus on different learning objectives, the cross-modal adaptive graph structure constraint module part will be quite unstable if their loss functions are simply combined directly for supervised network training. In order to solve the above problems, a method of dynamic multi-attention aggregation learning may be employed. The method is specifically realized by decomposing the whole joint learning framework into two different tasks which respectively act on the aggregated feature learning loss L in the mode _p And a cross-modal adaptive graph structure constraint module L _g Two parts. Wherein L is _p Is the learning target L _b And aggregated feature attention mechanism (AFA) proposed loss L _p-c Is a combination of (a) and (b). Namely:

L _p ＝L _b +L _p-c

wherein N represents the number of pictures per lot; p represents the probability that the feature is correctly classified; y is _i Representing the output picture characteristics;representing the input picture characteristics.

4.3 To guide learning of the cross-modal adaptive graph structure, a negative log-likelihood loss may be selected as a loss representation of the constraint module portion of the cross-modal adaptive graph structure, the loss function of which is defined as:

wherein the method comprises the steps ofIs the output feature after the pass through the graph convolution operation.

4.4 Inspired by multitasking, the dynamic multi-attention-focused learning strategy is actually to learn L _P Regarded as main loss, and then gradually adds learning loss L _g And (5) optimizing. The main reason for this is that the features that act on the image level partial aggregation can be learned more simply early in the training phase. Introducing a global feature learning loss L between cross modes after a period of time of learning through a supervision network _g The network can be further optimized without causing excessive concussions. The dynamic multi-attention aggregate learning penalty is expressed as:

wherein e is the training frequency,average loss value representing previous training round,/->Learning a loss value representing the intra-polymer characteristic of the current round mode,/->Representing the structural constraint value of the current round cross-modal adaptive graph.

4.5 The final total loss is represented by L _dhhi 、L _wsdr 、L _tri 、L _p And L ^t The composition is defined as follows:

L _total ＝L _dhhi +βL _wsdr +L _tri +L _p +L ^t

wherein: beta as super ginsengThe number controls the contribution of SDR loss. The DHHI optimizes a parameter sharing network with identity supervision, so that the network learns the multi-mode identity unchanged characteristic; WSDR loss L _wsdr Providing supervision to optimize the relative distances retrieved from the 6 views; l (L) _p 、L ^t Intra-modality and inter-modality feature relationships are learned from the pixel level and the image level, respectively, enhancing feature representation. These several components are jointly optimized for cross-modal pedestrian re-recognition model learning.

Claims

1. A cross-mode pedestrian re-identification method based on spectrum sensing and attention mechanism is characterized in that:

the network model is jointly trained by adding an additional set of input consisting of visible light, infrared and homogeneous enhanced gray images and using two sets of input images. Further enhancing the utilization of features in the limited image;

and by utilizing the attention mechanism, the richer local features of the pedestrian image are obtained to realize the interactive fusion of the modal information, so that the inter-modal pedestrian re-recognition effect is improved.

2. The method for cross-modal pedestrian re-recognition based on spectrum sensing and attention mechanisms as in claim 1, wherein the method comprises the steps of: a cross-modal pedestrian re-recognition model comprising: two groups of images for model input, one group consisting of visible light and infrared images, and the other group consisting of visible light images, infrared images, and gray scale images generated from visible light images. The system comprises a ResNet-50 backbone network for feature extraction, an aggregate feature attention module for mining local features of a model, and a cross-mode self-adaptive graph structure module for carrying out interactive fusion on feature information under different modes.

3. The method for cross-modal pedestrian re-recognition based on spectrum sensing and attention mechanisms as claimed in claim 2, wherein the method comprises the following steps: the ResNet-50 network model inputs that both sets of images participate in the calculation of tag loss and triplet loss. The network adopts a mode of combining a single-flow network and a double-flow network to extract and combine the characteristics of images of different modes.

4. A method of cross-modal pedestrian re-recognition based on spectrum sensing and attention mechanisms as in claim 3 wherein: for multi-modal classification, through a parameter sharing network learning multi-modal sharing identity classifier, homogeneous and heterogeneous recognition losses are designed, and the identity invariable representation of information mining sharing is realized. For multi-view retrieval, weighted hexa-directional ordering loss is developed to optimize relative distance across multiple modalities. To enhance robustness to modal changes, homogeneous invariant regularization is introduced.

5. The method for cross-modal pedestrian re-recognition based on spectrum sensing and attention mechanisms as claimed in claim 2, wherein the method comprises the following steps: an intra-modal weighted aggregation (IWPA) module is utilized that cooperates with three attention modules in a multi-module and multi-level embedding manner, the module being divided into two branches, one of which contains a Channel Attention Module (CAM), a Spatial Attention Module (SAM), and the other branch is a Position Attention Module (PAM). The input features are given by first X passing through a Channel Attention Module (CAM) in the first branch, resulting in a feature map X _C Then, through a Spatial Attention Module (SAM), a feature map X is obtained _S The method comprises the steps of carrying out a first treatment on the surface of the In another branch, X passes through a Position Attention Module (PAM) to obtain a feature map X _P The method comprises the steps of carrying out a first treatment on the surface of the Finally mapping the obtained characteristic to X _S ，X _P Adding, the final feature map is: x is X _OUT ＝X _S +X _P

6. The method for cross-modal pedestrian re-recognition based on spectrum sensing and attention mechanisms of claim 5, wherein the method comprises the following steps of: the attention characteristic diagram generated by the channel attention module is as follows: a is that _c ＝σ[W ₁ F _MP ；W ₂ F _AP ]Wherein, [;]representing a splice operation along a channel; sigma represents a sigmoid function; f (F) _MP ,F _AP Respectively represent the positions of global maximum pooling and global average poolingMapping the processed characteristics;

parameters respectively representing the full connection layer; the attention characteristic diagram generated by the spatial attention module is as follows: /> Representing a 1 x 1 convolution.

7. The method for cross-modal pedestrian re-recognition based on spectrum sensing and attention mechanisms as claimed in claim 2, wherein the method comprises the following steps: position attention profile A _p The following is shown:

wherein a is _i,j Refers to the effect of the i position on the j position; /> Respectively represent mapping of features->Dividing the characteristic diagram into p non-overlapping parts and then carrying out convolution; />And->Multiplication to obtain a local attention profile +.>W ^p A learnable weight vector representing different parts.

8. The method for cross-modal pedestrian re-recognition based on spectrum sensing and attention mechanisms as claimed in claim 2, wherein the method comprises the following steps: the self-adaptive graph structure constraint module is introduced to solve the problem of cross-mode pedestrian re-identification, and is mainly used for aggregating pedestrian samples belonging to the same individual among modes.

Wherein, the adaptive graph structure is expressed as:wherein A is ^g For normalizing the adjacency matrix undirected graph, l _i And/l _j The graph nodes and the corresponding independent thermal codes are respectively I| _K Is a matrix formed by itself.

9. The method for cross-modal pedestrian re-recognition based on spectrum sensing and attention mechanisms as claimed in claim 2, wherein the method comprises the following steps: the dynamic multi-attention-focused learning strategy is actually to get L _p Regarded as main loss, and then gradually adds learning loss L _g And (5) optimizing. The main reason for this is that the features that act on the image level partial aggregation can be learned more simply early in the training phase. Introducing a global feature learning loss L between cross modes after a period of time of learning through a supervision network _g The network can be further optimized without causing excessive concussions. The dynamic multi-attention aggregate learning penalty is expressed as:

wherein e is training frequency, < >>Average loss value representing previous training round,/->Learning a loss value representing the intra-polymer characteristic of the current round mode,/->Representing the structural constraint value of the current round cross-modal adaptive graph.