CN116523969B

CN116523969B - MSCFM and MGFE-based infrared-visible light cross-mode pedestrian re-identification method

Info

Publication number: CN116523969B
Application number: CN202310772990.XA
Authority: CN
Inventors: 齐冲冲; 林旭; 许乐; 尚永乐
Original assignee: Yunnan United Visual Technology Co ltd
Current assignee: Yunnan United Visual Technology Co ltd
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2023-10-03
Anticipated expiration: 2043-06-28
Also published as: CN116523969A

Abstract

The invention relates to an infrared-visible light cross-mode pedestrian re-identification method based on MSCFM and MGFE. The method comprises the following steps: constructing a visible light mode image set and an infrared mode image set; images of the same pedestrian in two modes are simultaneously input into a convolution network to extract feature images; splicing a visible light mode characteristic diagram and an infrared mode characteristic diagram; digging sharing characteristics of the same pedestrian in different modes; carrying out global context modeling on the structure of the pedestrian; splicing the obtained self-attention-enhanced feature graphs in batch dimensions, and sending the spliced feature graphs into a shared feature embedding network for encoding; the multi-granularity feature enhancement module is used for conducting different granularity blocking operation on the feature map to guide the network to pay attention to the fine granularity information with identification contained in the feature map, and iterative training is conducted to obtain a final model; and acquiring the image data to be identified, and identifying the image data to be identified by utilizing the finally obtained pedestrian re-identification model.

Description

MSCFM and MGFE-based infrared-visible light cross-mode pedestrian re-identification method

Technical Field

The invention relates to an infrared-visible light cross-mode pedestrian re-identification method based on MSCFM and MGFE, and belongs to the technical field of image identification.

Background

Pedestrian re-recognition technology aims at solving the problem of pedestrian retrieval across cameras. In recent years, this technology has achieved many research results and plays an important role in some practical applications. However, most pedestrian re-recognition tasks are currently studied to search for pedestrians in a visible light mode, and good pedestrian recognition effects cannot be obtained in special environments such as insufficient illumination and night. With the continuous popularization of the dual-mode monitoring system, people can effectively acquire the video information of pedestrians through the visible light mode and the infrared light mode of the system under the conditions of normal illumination and insufficient illumination. In order to meet the requirements of specific scenes and specific environments, in recent years, researchers have conducted a great deal of research on infrared-visible light cross-modal pedestrian re-identification (VI-ReID) technology. VI-ReID is given a query image of a pedestrian in one mode, features are extracted through an Re-ID model, and then a pedestrian target with the same identity can be found in a gallery image set in the other mode, so that a cross-mode pedestrian retrieval task is completed. Since the imaging modes of the visible light image and the infrared light image are different, the characteristics of pedestrians are greatly different among different modes. In addition, the appearance characteristics of pedestrians are greatly different in the same mode under the influence of uncontrollable factors such as shooting visual angles, posture changes, shielding and the like. Therefore, how to overcome the problems, the network is enabled to effectively extract the distinguishing characteristics of pedestrians in two modes is a key for carrying out infrared-visible light cross-mode pedestrian re-identification research. Summarizing the work of researchers at present, methods for effectively extracting the discriminative features of pedestrians in two modes are mainly divided into a research method based on countermeasure generation, a research method based on 'middle mode' guidance, a research method based on high-order semantic information mining, a research method based on attention mechanisms, a research method based on mining the fine granularity features of pedestrians and a mixed method combining the above. The research method based on the countermeasure generation is to realize the modal information migration through the countermeasure game between the generator and the discriminator, and gradually eliminate the modal difference. However, such methods are not only highly dependent on the quality of the generated image, but also easily introduce additional modal information to exacerbate modal differences, limiting further improvement of model performance. The research method based on the 'middle mode' guidance is to reduce the gap between the infrared light mode and the visible light mode by introducing 'middle mode' information as a 'bridge', and is a common VI-ReID method. However, introducing "intermediate modality" information will inevitably introduce interference information, while also introducing additional computational consumption.

The research method based on the high-order semantic information mining is to get rid of the influence of modal information, such as structural relation, high-frequency information and the like, by mining the high-order relation contained among the pedestrian characteristics of different modes. Such methods typically introduce additional auxiliary models when extracting higher order feature relationships, such as using a key point model to obtain structural relationship features, etc. However, the auxiliary model is very susceptible to actual conditions such as shielding, low resolution and the like existing in the sample, so that the detection accuracy of the auxiliary model is not high, which not only increases the complexity of an algorithm, but also reduces the recognition efficiency of the training model.

The research method based on the attention mechanism is to guide the feature map to highlight important discriminative information and inhibit unimportant information by adopting the attention mechanism, so as to complete the extraction of the mode consistency feature. However, in this process, the presence of features with weak attention scores and no strong discrimination can cause serious interference in extracting significant discriminatory features.

The research method based on the fine granularity characteristic of the excavated pedestrians aims at solving the problem that the model can not extract the identification information of pedestrians because the appearance characteristic of the pedestrians in the same mode is influenced by uncontrollable factors such as shooting view angles, posture changes, shielding and the like. Many workers direct the network to focus on the discriminative information of each local feature by blocking the feature map. However, it is not advantageous to extract the identifying characteristics of pedestrians because the blocking manner is too single.

Disclosure of Invention

In order to solve the problems, the invention provides an infrared-visible light cross-mode pedestrian re-identification method based on MSCFM and MGFE, which can more accurately identify pedestrians and improve identification accuracy.

The technical scheme of the invention is as follows: an infrared-visible light cross-mode pedestrian re-identification method based on MSCFM and MGFE comprises the following steps:

step 1: constructing a visible light mode image set and an infrared mode image set;

step 2: images of the same pedestrian in two modes are simultaneously input into a convolution network to extract feature images;

step 3: splicing a visible light mode characteristic diagram and an infrared mode characteristic diagram;

step 4: the K large-value mask cross attention is utilized to mine sharing characteristics of the same pedestrian under different modes;

step 5: using a self-attention method to model the global context relation of the pedestrian structure under the same mode;

step 6: splicing the obtained self-attention-enhanced feature graphs in batch dimensions, and sending the spliced feature graphs into a shared feature embedding network for encoding;

step 7: the multi-granularity feature enhancement module is used for conducting different granularity blocking operation on the feature map to guide the network to pay attention to the fine granularity information with identification contained in the feature map, and a final pedestrian re-identification model is obtained;

step 8: and acquiring the image data to be identified, and identifying the image data to be identified by utilizing the finally obtained pedestrian re-identification model.

Further, in the step 2, the images of the same pedestrian in two modes need to be input into the network together for processing, that is, the identities of the pedestrian images processed by the two-path convolution network each time are in one-to-one correspondence.

Further, in the step 3, a query feature map of the visible light modality is obtainedAnd key feature map in infrared modeObtaining semantic relationship attention diagram through matrix multiplicationThe formula is as follows:

wherein ,for the sigmoid activation function,for the matrix transposition operation,for the matrix multiplication operation,representing the semantic relationship between the query modality feature map and the queried modality feature map pixels.

Further, in the step 4, a K-value mask operation is performed on the element values of each row, that is, the semantic relation value of the previous K-value in each row is reserved, and the semantic relation value lower than the K-value is set to zero, so as to extract the cross-modal significant sharing information of the pedestrian.

Further, in the step 5, the enhanced cross-modal feature map output by the K-value mask cross-attention module is sent to a self-attention module, where the attention module is configured to obtain a relationship between pixel points in the feature map, and apply a weight of the relationship to the feature map, so as to enhance the discriminative feature of the feature map in a global angle; for enhanced visible light modal feature mapWill beInput to a value convolution layerQuery convolution layerAnd key convolution layerObtaining a value characteristic diagram in a visible light modeQuery feature mapAnd key feature mapThe method comprises the steps of carrying out a first treatment on the surface of the Query feature map of visible light mode through formulaAnd key feature mapPerforming matrix multiplication operation to obtain self-attention relationship diagram，The formula is as follows:

wherein ,for the sigmoid activation function,for the matrix transposition operation,for the matrix multiplication operation,each row in (a) represents an input feature mapSemantic relationships between the middle pixel points;

self-attention relationship graphAnd value feature mapPerforming matrix multiplication operation to obtain a result, and inputting the result into a convolution layerRestoring the original channel size and then combining the channel size with the value characteristic diagramResidual operation is carried out, and finally, a characteristic diagram enhanced by self-attention is obtainedThe method comprises the steps of carrying out a first treatment on the surface of the Similarly, obtain the infrared light mode；

。

Further, in the step 6, a mode significant consistency feature mining module (MSCFM) is configured to mine a feature map in a visible light modeFeature map in infrared light modeAfter splicing in batch dimension, sending the batch dimension into a shared feature embedded networkCoding, obtaining feature vectors through a batch normalization layer BN and a global average pooling layer GAPAnd identity constraint is carried out on the cross entropy loss, wherein the loss function is as follows:

wherein ,is the firstThe true identity tag to which the individual samples correspond,represent the firstThe probability scores obtained by the classifier for each sample,is an identity classifier of the MSCFM module,represented as a batch normalization layer,represented as a global average pooling layer.

Further, in the step 7, in order to achieve the purpose of extracting comprehensive pedestrian identification characteristics and relieving intra-class differences in the same mode, the method provides a multi-granularity characteristic enhancement Module (MGFE); the multi-granularity feature enhancement module MGFE will be embedded into the network by the shared featureThe feature map obtained by encoding is divided in the height dimensionAnd performing horizontal blocking operations with different granularities.

The beneficial effects of the invention are as follows:

1. the method comprises the steps that a mode significant consistency feature mining module is provided, and the mode significant consistency feature mining module eliminates adverse effects of weak identification information and non-identification information on extraction of cross-mode features by carrying out mask operation on attention scores, so that the significance of identification features in the attention scores is improved, and pedestrian feature extraction with mode significant consistency is realized;

2. the multi-granularity characteristic enhancement module is provided, the multi-granularity local characteristics are extracted by utilizing a diversified blocking mode, identity constraint is added for each local characteristic, adverse effects of local semantic misalignment caused by problems of shielding, low resolution and the like are eliminated, and the identity discrimination of the characteristics is improved, so that the discrimination characteristics of pedestrians are enhanced;

3. an effective inter-mode re-identification framework is constructed, and mutual re-identification between visible light and infrared pedestrian pictures is realized;

4. the invention can be applied to pedestrian recognition technology and can be used in the fields of video monitoring, intelligent security, site management and the like. By introducing the mode significant consistency feature mining module and the multi-granularity feature enhancement module, the pedestrian can be identified more accurately, and the identification accuracy is improved; compared with the existing other pedestrian re-identification methods, the pedestrian re-identification method has higher accuracy.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a general frame diagram of the present invention;

FIG. 3 is a flow chart of a mode salient consistency feature mining module of the present invention;

FIG. 4 is a flow chart of a multi-granularity feature enhancement module of the present invention;

FIG. 5 is a graph of distance profiles of other methods (baseline) versus positive and negative pairs of samples of the proposed method; experimental effect on baseline on the left; the right side is the experimental effect of the method provided by the invention; the abscissa is the distance between the positive and negative pairs of samples and the baseline, and the ordinate is the logarithm of the samples;

FIG. 6 is a graph comparing performance of the proposed method on pedestrian retrieval results;

FIG. 7 is a class activation map comparison of the present invention for a baseline and the proposed method of the present invention.

Detailed Description

Example 1: as shown in fig. 1-7, an infrared-visible light cross-mode pedestrian re-identification method based on MSCFM and MGFE comprises the following steps:

step 1: constructing a visible light mode image set and an infrared mode image set; the picture data in the specific embodiment is taken from the common data sets SYSU-MM01 and RegDB;

SYSU-MM01 is a large public data set proposed for the research of the task of infrared-visible cross-modality pedestrian re-identification. The data set is divided into a training set and a test set to train and evaluate the proposed cross-modal model. The training set comprises 22258 visible light pedestrian images of 395 pedestrians shot by 4 visible light cameras, and the test set comprises 11909 infrared light pedestrian images of 96 pedestrians shot by 2 infrared cameras. In the test set, the query set consisted of 3803 infrared images, and the gallery set consisted of ten randomly sampled visible images, each set containing 301 images. In the evaluation phase, the data set has two test modes of all-search and index-search. In all-search test mode, the images in the gallery set consist of images taken by indoor and outdoor cameras. In the index-search test mode, the gallery set uses only images taken by an indoor camera. The final performance of this dataset was the average of 10 test experiments.

The RegDB dataset consisted of 8240 images of 254 female identities and 158 male identities. Wherein each pedestrian contains 10 infrared light images and 10 visible light images. During the training phase, the dataset uses a randomly selected 206-identity pedestrian image set as the training set, and the remaining 206-identity pedestrian image set is used as the test. In the evaluation phase, 10 random experiments will be repeated, taking the average of 10 groups of experiments as the final performance.

Step 2: images of the same pedestrian in two modes are simultaneously input into a convolution network to extract feature images; in the method, a double-flow network framework is adopted, and in the step 2, images of the same pedestrian in two modes are required to be input into a network together for processing, namely, identities of the pedestrian images processed by the two-path convolution network each time are in one-to-one correspondence.

in the step 3, the query feature map of the visible light mode is obtainedAnd key feature map in infrared modeObtaining semantic relationship attention diagram through matrix multiplicationThe formula is as follows:

in the step 4, a K-value mask operation is performed on the element values of each row, that is, the semantic relation value of the previous K-value in each row is reserved, and the semantic relation value lower than the K-value is set to zero, so that cross-modal significant sharing information of pedestrians is extracted. During training, the range of K values is 0-2592.

further, in the step 5, the enhanced cross-modal feature map output by the K-value mask cross-attention module is sent to a self-attention module, where the attention module is configured to obtain a relationship between pixel points in the feature map, and apply a weight of the relationship to the feature map, so as to enhance the discriminative feature of the feature map in a global angle; in an enhanced visible light mode profileFor example, willInput to a value convolution layerQuery convolution layerAnd key convolution layerObtaining a value characteristic diagram in a visible light modeQuery feature mapAnd key feature mapThe method comprises the steps of carrying out a first treatment on the surface of the Query feature map of visible light mode through formulaAnd key feature mapPerforming matrix multiplication operation to obtain self-attention relationship diagram，The formula is as follows:

self-attention relationship graphAnd value feature mapPerforming matrix multiplication operation to obtain a result, and inputting the result into a convolution layerRestoring the original channel size and then combining the channel size with the value characteristic diagramResidual operation is carried out, and finally, a characteristic diagram enhanced by self-attention is obtained；

Similarly, obtain the infrared light mode；

wherein ,wherein the characteristic diagram is a value characteristic diagram in an infrared light mode,a self-attention relationship diagram in an infrared light mode;

Step 7: in order to achieve the purpose of extracting comprehensive pedestrian identification characteristics and relieving intra-class differences in the same mode, the method provides a multi-granularity characteristic enhancement Module (MGFE); the multi-granularity feature enhancement module is used for conducting different granularity blocking operation on the feature map to guide the network to pay attention to the fine granularity information with identification contained in the feature map, and a final pedestrian re-identification model is obtained;

the multi-granularity feature enhancement module MGFE will be embedded into the network by the shared featureThe characteristic diagram obtained by coding is respectively subjected to horizontal blocking operations with different granularities in the height dimension to obtain，And. Wherein l, m, s respectively represent the feature mapThe horizontal division of 1,3,6 blocks,is a block index.

The blocking feature convolution channel of the MGFE module is set to 256, the batch size is set to 96, and the MGFE module is composed of pedestrian images in two modes, each mode containing 48 images of 6 pedestrians. In addition, all network parameters included in the experiment are optimized by adopting an SGD (generalized gateway) optimizer and matching with a wakeup strategy, wherein momentum is set to 0.9, weight_decade is set to 5 x 10 < -4 >, and the total training process is 60 generations.

The loss function formula of the invention is as follows:

wherein ,，，andthe weight of identity loss of each module is respectively given.To constrain the cross entropy penalty of the shared feature vector,，，the cross entropy loss when the number of the horizontal blocks is 1,3 and 6 respectively,，the triplet losses for the number of horizontal partitions of 1,6, respectively. Through experimental tests, the existence of the triple loss of the block 3 has little influence on the result.

In the present invention, two common indicators of pedestrian re-recognition are used to evaluate experimental performance, namely Cumulative Matching Curve (CMC) and average mean precision (mAP), respectively.

The Cumulative Matching Curve (CMC), also known as Rank-K value curve, calculates the top K hit ratios in the test results. For example, rank-1 reflects the top-ranked search target probabilities in the detection results. Rank-5 reflects the probability of including the search target in the first 5 search results among the detection results.

The average precision mean (mAP) refers to the mean value of the average prediction precision of each query image and is used for evaluating the overall effect of the pedestrian re-recognition algorithm, wherein AP refers to the average precision of one query sample and represents the effect of a model on a certain sample, and mAP refers to the average value of all query samples AP and represents the overall effect of the model on all query samples.

Table 1 shows the effect of the other methods on the SYSU-MM01 dataset

As shown in table 1. Wherein 19, 20, 21 and 22 in brackets in the left column respectively represent indexes of 2019, 2020, 2021 and 2022, and the performance of the method provided by the invention exceeds that of other methods in an all-search mode and an indoor search mode. Specifically, in the full search mode (all-search), the method provided by the invention achieves 64.40% and 60.70% in Rank-1 and mAP respectively, and exceeds the suboptimal methods DTRM 1.37% and 2.07% respectively. In the indoor search mode, the method provided by the invention obtains 68.08% of performance on a Rank-1 index and 72.26% of performance on a mAP index, which are respectively higher than 1.73% and 0.5% of that of a secondary method DTRM.

In order to more fully verify the performance of the proposed method, the proposed method is compared with other methods on the RegDB dataset.

Table 2 shows the effect of the other methods on the regDB dataset

Overall, the method of the present invention achieves significant advantages in both the "visible light image looking up infrared light image" (Visible To Infrared) and the "infrared light image looking up visible light image" (Infrared To Visible) experimental settings. Under the experimental setting of Visible-To-Inforred, the precision of the method is 95.45% in Rank-1 and 99.02% in mAP, which exceed 13.22% and 20.57% of the DTRM of the suboptimal method respectively. Under the experimental setting of Infrared To Visible, the method provided by the invention respectively obtains 94.20% and 98.81% recognition precision on Rank-1 and mAP, which exceed 15.27% and 23.23% of the inferior method DTRM.

FIG. 5 is a graph showing the distance distribution of other methods (baselines) from positive and negative pairs of samples of the proposed method, left is the experimental effect of the baselines, right is the experimental effect of the proposed method; specifically, as shown in fig. 5, the left side of the graph represents the distance distribution of the positive and negative pairs of samples of the baseline; the right hand side shows the characteristic distance distribution of the positive and negative pairs of samples of the method of the invention. It can be seen that the overlapping area of the feature distance distribution of the positive and negative sample pairs by the method is smaller, which shows that the method can extract pedestrian features with identifying mode consistency.

FIG. 6 is a graph showing the comparison of the performance of the proposed method on pedestrian retrieval results; in fig. 6 and 7, in order to prevent clear faces from appearing, the face portions are intentionally blurred, so that the blurring effect of the face portions is not an effect after the present invention is applied, and is not taken as a judgment of the performance of the present invention; as shown in fig. 6, the figure shows a visual effect diagram of a search pedestrian. Obviously, given the same pedestrian inquiry image, after adding part6 and part3, the first 5 search results in the search results of "base+mgfe" are more accurate than the Base method. After the MSCFM module is added on the basis of base+MGFE, the first 4 in the search result are all hits. part6 and part3 are cases where the multi-granularity feature enhancement Module (MGFE) is horizontally split into 6 blocks and 3 blocks, respectively.

FIG. 7 is a comparison of a baseline and class activation maps of the proposed method; the method of the invention visualizes pedestrian areas concerned by the model through a class activation mapping method. As shown in fig. 7, compared with the baseline method, the method of the present invention can prompt the network to pay attention to more comprehensive pedestrian identification characteristics in both the visible light mode and the infrared light mode, and the base line network is interfered by the background, so that the pedestrian identification characteristics cannot be extracted finally.

While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The infrared-visible light cross-mode pedestrian re-identification method based on MSCFM and MGFE is characterized by comprising the following steps of: the method comprises the following steps:

step 8: acquiring image data to be identified, and identifying the image data to be identified by utilizing the finally obtained pedestrian re-identification model;

in the step 3, the query feature map of the visible light mode is obtainedAnd key feature map in the infrared mode +.>Obtaining a semantic relationship attention map by matrix multiplication>The formula is as follows:

；

wherein ,activating a function for sigmoid->For matrix transpose operation, +.>For matrix multiplication operations, +.>Each row in the list represents the semantic relation between the query mode characteristic image and the queried mode characteristic image pixel points;

in the step 4, performing K-value masking operation on the element values of each row, namely reserving the semantic relation value of the previous K-value in each row and setting the semantic relation value lower than the K-value to zero so as to extract the cross-modal significant sharing information of pedestrians; during training, the range of K values is 0-2592.

2. The infrared-visible cross-modal pedestrian re-identification method based on MSCFM and MGFE of claim 1, wherein: in the step 2, the images of the same pedestrian in two modes need to be input into the network together for processing, namely, the identities of the pedestrian images processed by the two-path convolution network each time are in one-to-one correspondence.

3. The infrared-visible cross-modal pedestrian re-identification method based on MSCFM and MGFE of claim 1, wherein: in the step 5, the enhanced cross-modal feature map output by the K-value mask cross-attention module is sent to a self-attention module, and the attention module is used for acquiring the relation between pixel points in the feature map and applying the relation weight on the feature map, so as to enhance the identification feature of the feature map in a global angle; for enhanced visible light modal feature mapWill->Input to a value convolution layer +.>Query convolution layer->And Key convolution layer->Obtaining a value characteristic diagram +.>Query feature map->And key feature map->The method comprises the steps of carrying out a first treatment on the surface of the Query feature map of visible light mode through formula +.>And key feature map->Performing matrix multiplication to obtain self-attention relationship diagram +.>，/>The formula is as follows:

；

wherein ,activating a function for sigmoid->For matrix transpose operation, +.>For matrix multiplication operations, +.>Each row of (a) represents an input feature map +.>Semantic relationships between the middle pixel points;

self-attention relationship graphAND value profile->Performing matrix multiplication to obtain a result, and inputting the result into a convolution layer +.>Restoring the original channel size, and then associating the channel size with the value characteristic diagram +.>Residual operation is carried out, and finally, a characteristic diagram enhanced by self-attention is obtained>The method comprises the steps of carrying out a first treatment on the surface of the Similarly, get +.>；

。

4. The infrared-visible cross-modal pedestrian re-identification method based on MSCFM and MGFE of claim 1, wherein: in the step 6, the mode significant consistency feature mining module is configured to mine a feature map in a visible light modeCharacteristic diagram in infrared light mode>After splicing in batch dimension, send into shared feature embedded network +.>Coding, obtaining a feature vector ++through a batch normalization layer BN and a global average pooling layer GAP>And identity constraint is carried out on the cross entropy loss, wherein the loss function is as follows:

；

wherein ,is->True identity tag corresponding to each sample, +.>Indicate->Probability score of individual samples by classifier, < ->Identity classifier for MSCFM module, < ->Expressed as a batch normalization layer,>represented as a global average pooling layer.

5. The infrared-visible cross-modal pedestrian re-identification method based on MSCFM and MGFE of claim 1, wherein: in the step 7, the multi-granularity feature enhancement module MGFE is embedded into the network by the shared featureAnd carrying out horizontal blocking operations with different granularities on the characteristic diagrams obtained by encoding in the height dimension.