CN114550208A

CN114550208A - Cross-modal pedestrian re-identification method based on global level and local level combined constraint

Info

Publication number: CN114550208A
Application number: CN202210123546.0A
Authority: CN
Inventors: 王进; 张天奇; 江锴威
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2022-05-27

Abstract

The invention discloses a cross-modal pedestrian re-identification method based on global level and local level combined constraint. Firstly, a residual error network based on a non-local attention mechanism is provided, shared features are extracted, and cross-modal differences are reduced. Secondly, the joint constraint of the global level and the local level is provided, the robustness of noise such as background and shielding is improved, and intra-mode difference is reduced. Particularly, by using a partitioning strategy for local features, the problem that a non-local attention mechanism lacks position correlation is avoided, and the robustness of the model is further increased.

Description

Cross-modal pedestrian re-identification method based on global level and local level combined constraint

Technical Field

The invention relates to the technical field of artificial intelligence and computer vision, in particular to a cross-modal pedestrian re-identification method based on global level and local level combined constraint.

Background

Pedestrian re-identification is intended to identify a specific pedestrian from images of pedestrians captured by cameras of non-overlapping fields of view. The re-identification of the pedestrians can be regarded as a sub-problem of image retrieval, and the method has a wide application prospect in the field of intelligent monitoring. For example, in a dining room, there are dining areas, entrances and exits, and the periphery outside the building, the shooting areas between the monitoring camera devices do not overlap each other, and given a pedestrian image, it is necessary to identify the monitoring data captured by all the monitoring camera devices in the dining room, and identify the corresponding pedestrian image according to the pedestrian image.

In early studies, pedestrian re-identification was considered to be a single-modality pedestrian re-identification that only worked during the day. In the whole recognition process, the pedestrian re-recognition task assumes that all pedestrian images are color images. In a real-world environment, this assumption is ideal because a color camera cannot capture a clear pedestrian image in a low-light environment. For example, in the canteen mentioned in the above paragraph, if only the color camera is used for shooting, the quality of the pedestrian image shot in the daytime is better, and the quality shot at night is poorer, and especially, the monitoring camera devices deployed around the entrance and the outside of the building cannot obtain clear pedestrian images due to insufficient illumination environment.

With the development of the technology, most surveillance cameras are equipped with a color camera and an infrared camera at the same time, the color camera captures images of pedestrians in an environment with good illumination, and the infrared camera captures images of pedestrians in a low-illumination environment. For example, in the canteen mentioned in the above paragraph, at night, the monitoring camera devices disposed around the entrance and the outside of the building have poor illumination environment, and the infrared camera therein can be used to capture images of pedestrians. The method creates an advantage for the pedestrian Re-Identification task to get rid of illumination limitation, and the cross-mode pedestrian Re-Identification task based on the color mode and the Infrared mode is generated, namely, the color-Infrared pedestrian Re-Identification (VI-ReiD) task. For example, in the canteen referred to in this paragraph, a suspicious pedestrian is captured by the surveillance camera at night, it is necessary to determine whether the pedestrian is present in the canteen during the day, and it is then necessary to match the infrared image to the set of color images captured during the day. However, color-infrared pedestrian re-recognition is a multi-modal recognition problem, with greater challenges than single-modal recognition problems.

Fig. 1 shows a partial color-mode and infrared-mode image of a pedestrian in the SYSU-MM01 dataset with a total of two identities, five color images and five infrared images of the same identity for each row. The number of channels of the color-mode pedestrian image is 3, and the infrared-mode pedestrian image only contains 1 channel. Meanwhile, the wavelength ranges of the color image and the infrared image are different, and the infrared image lacks important information such as color, exposure degree and the like. As can be seen in connection with fig. 1, one of the challenges of color-infrared pedestrian re-identification is the huge cross-modal difference.

Fig. 2 shows some pedestrian images of the same pedestrian taken by all color modality cameras in the SYSU-MM01 dataset. Within the SYSU-MM01 dataset, there are a total of four color cameras. Due to the fact that shooting angles of the cameras are different, postures of pedestrians are different, and noises such as background environments, sheltering and the like exist, the intra-mode difference is large. The difference between pedestrian images with the same identity is larger than that between pedestrian images with different identities, so that the intra-class difference is larger than the inter-class difference, and the color-infrared pedestrian re-identification task is challenging.

In recent years, many color-infrared pedestrian re-recognition methods have been proposed. Wu et al^[1]Three network architectures for pedestrian re-identification are proposed: the system comprises a single-flow network, a double-flow network and an asymmetric full-connection layer, and provides a deep zero filling method, so that the number of channels in a color mode is the same as that of channels in an infrared mode, and the cross-mode difference is reduced. However, the method lacks a distance measurement learning process, so that the identification accuracy is influenced, and the problem of intra-mode difference is not solved.

Thus, Ye, etc^[2]The double-constraint top-level loss based on the double-flow network is used, the problem of cross-modal difference is solved, and the intra-class difference is further reduced by using an identity loss function containing a distance learning process. However, the method is based on the premise that the data distribution of the training sample and the data distribution of the test sample are consistent. In practiceIn the cross-modal pedestrian re-identification task, the data distribution of the two is often different because images of different modalities cannot be captured by the same device at the same time.

Thus, Wu et al^[3]The method considers the condition that the data distribution of the training sample and the test sample is different, converts the shared knowledge mining problem in cross-modal matching into a cross-modal similarity retention problem, and uses the similarity of the samples in the same modal to restrict the similarity of the samples in different modals, thereby further solving the cross-modal difference problem.

Further, Ye and the like^[4]A baseline, generalized average pooling and weighted regularization triplet penalty with non-local attention mechanism is designed. The information of the middle layer and the high layer is acquired through a non-local attention mechanism, and the distinguishing capability of the features is enhanced. The shared characteristic ratio Wu and the like obtained by the model due to the existence of a non-local attention mechanism^[3]The method has more global and long dependence. However, the model extraction shared feature is based on the global feature, the sample contains more noise with interference properties such as background and occlusion, and the model using only the global feature has poor robustness to the noise. Moreover, the non-local attention mechanism adopted by the model only focuses on the global correlation at the pixel level, and does not consider the position correlation, so that the noise can also be focused on globally. These all make it more difficult for the model to learn the distinctive global features.

Reference to the literature

[1] Wu A, Zheng W S, Yu H X, et al, RGB-Infrared Cross-Modality Person Re-identification [ A ]. Proc. IEEE/CVF int.Conf.Comp.Vis. (ICCV) [ C ] Venice, Italy:2017: 5390-.

[2] Ye M, Wang Z, Lan X, et al, Visible Thermal Person Re-Identification via Dual-structured Top-Ranking [ A ]. Proc.27th int. Joint Conf. Artif. Intell. (IJCAI) [ C ]. Stockholm, Sweden: International Joint communications on Intelligent Organization,2018: 1092-.

[3] Wu A, Zheng W S, Gong S G, et al, RGB-IR Person Re-identification by Cross-modification Similarity prediction [ J ]. International Journal of Computer Vision,2020,128(6): 1765-.

[4] Ye M, Shen J, Lin G, et al, Deep Learning for Person Re-identification A Survey and Outlook [ J ]. IEEE Transactions on Pattern Analysis and Machine understanding, 2021: 1-25.

Disclosure of Invention

The purpose of the invention is as follows: the existing color-infrared pedestrian re-identification method mostly extracts the shared features of two modes in a global characterization learning mode, and has poor robustness on background, shielding and other noises. In order to improve the robustness of the model to noise such as background, shielding and the like and improve the expression capability of the features, the invention provides a joint constraint model based on global features and local features.

The technical scheme is as follows: to achieve the above object, the present invention proposes a color-infrared pedestrian re-identification method to reduce the cross-modal difference and the intra-modal difference in the cross-modal pedestrian re-identification task. Specifically, the invention provides an end-to-end characteristic learning framework GLoC-Net (Global-level and Local-level Constraints Network, GLoC-Net) based on a two-way Network structure for a color-infrared pedestrian re-identification task, and a framework diagram of the Network is shown in FIG. 3.

The GLoC-Net training flow chart provided by the invention is shown in figure 4. The training process adopts a mini-batch training mode to train, p pedestrians are randomly selected each time, and k color images and k infrared images are randomly selected for the pedestrians respectively. Next, the training process will be described by taking an example of inputting a color image and an infrared image, where p is 1 and k is 1, as follows:

step 1: inputting 1 color image and 1 infrared image to a GLoC-Net model, and entering the step 2;

step 2: generating a global-rich shared feature for the images of the two modalities input in the step 1 by using a ResNet50 network and a non-local attention block, and entering a step 3;

and step 3: extracting corresponding global features from the shared features, and entering step 4;

and 4, step 4: extracting corresponding local features from the shared features, and entering step 5;

and 5: performing global level and local level combined constraint on the global characteristics obtained in the step 3 and the local characteristics obtained in the step 4, updating model parameters in a back propagation mode, and entering a step 6;

step 6: if the specified number of training rounds is reached, performing step 7, otherwise, continuing to finish the training, and returning to the step 1;

and 7: and (6) ending.

Preferably, the color image and the infrared image in step 1 are both from a standard data set for cross-modal pedestrian re-identification, such as SYSU-MM01, RegDB, etc. The color image is composed of three primary colors of red (R), green (G), and blue (B), and has three channels, each channel corresponding to a primary color. While the infrared image records the heat radiated outside of the object, in the present invention, in the form of a single channel. The invention assumes that the color pedestrian image has n in total₁The infrared pedestrian image has n₂Samples of sheet, color mode can be represented as

The infrared modal sample can be expressed as

Wherein,

representing the i-th color pedestrian image,

representing the jth infrared pedestrian image,

and

respectively represent

And

a corresponding identity, and

and (4) carrying out random cutting with zero padding and random horizontal overturning on each color image and each infrared image, and finally forming a pedestrian image with the height of 256 and the width of 128. Consistent with the above, the present invention provides for inputting a color mode sample

And an infrared modal sample

For example, the working principle of the present invention in the training process is described.

Preferably, the method adopts a dual-flow Network structure, and utilizes a Residual error Network (Residual Network) and a Non-Local Attention Mechanism (Non-Local Attention Mechanism) to extract shared features. The ResNet50 network is a typical network in the residual network, and includes 50 two-dimensional convolution operations. The invention adopts the ResNet50 network, which can increase the depth of the network and improve the feature expression ability. Meanwhile, the invention is embedded in the ResNet50 network in the form of a non-local attention block by using a non-local attention mechanism, so that the receptive field of the features can be increased, and the features are rich in global information. The structure of the non-local attention block is shown in fig. 5.

Preferably, the step of extracting the shared features in step 2 of the training flow of the present invention is as follows:

step 2-1: rendering a color image

And infrared image

Input network, which is used to extract shallow feature f of two modes through one layer of convolution network_i ^Ori-RGBAnd

step 2-2: will f is_i ^Ori-RGBAnd

simultaneously inputting the data into a network consisting of a convolutional layer and a non-local attention block which are four layers behind ResNet50, and respectively forming a shared characteristic f of two modes_i ^Share-RGBAnd

step 2-3: output sharing feature f_i ^Share-RGBAnd

preferably, in step 2-1, the convolution layers through which the images of the two modalities pass have the same structure and different parameters.

Preferably, the present invention embeds two non-local attention blocks at the second and third convolutional layers of the four convolutional layers of ResNet50, respectively, to combine into the shared feature extraction network of step 2-2.

Preferably, the step of extracting global features in step 3 of the training flow of the present invention is as follows:

step 3-1: will share the feature f_i ^Share-RGBAnd

global average pooling is carried out to obtain a quasi-global feature f rich in global property_i ^GP-RGBAnd

step 3-2: will f is_i ^GP-RGBAnd

input deviceBN (batch normalization) layer, generating final global feature f_i ^G-RGBAnd

step 3-3: output global feature f_i ^G-RGBAnd

preferably, in step 3-2, the BN layer is used to make the data distribution approximate to a normal distribution, thereby avoiding the problem of gradient disappearance.

Preferably, the step of extracting local features in step 4 of the training flow of the present invention is as follows:

step 4-1: the feature f will be shared using a 1 x 1 convolution operation_i ^Share-RGBAnd

the number of channels is reduced to one fourth of the original number to obtain f_i ^Share-RGB′And

step 4-2: to f_i ^Share-RGB′And

dividing the four parts into four parts, and performing average pooling operation on each part to obtain local feature group composed of four local feature blocks

And

step 4-3: grouping of local features f_i ^LP-RGBAnd

respectively inputting the local feature blocks into corresponding BN layers, respectively splicing the local feature blocks in the local feature group passing through the BN layers to obtain final local features

And

step 4-4: outputting local features

And

preferably, the global level and local level joint constraints in step 5 of the training flow of the present invention are composed of global level constraint penalties, local level constraint penalties, and local feature constraint global feature penalties.

Preferably, the global level constraint loss and the local level constraint loss in step 5 of the training flow are both based on the difficult triple loss and identity loss of the VI-ReiD task.

Preferably, the VI-ReID task-based difficult triple loss provided by the invention considers a color mode and an infrared mode on the basis of the traditional triple loss, and increases a difficult sampling process, thereby not only increasing the application range of the triple loss, but also improving the training speed of the model and the accuracy of the retrieval task. The difficult triple loss can perform difficult sampling on the difficult samples of the two modes, and a positive sample pair which is difficult to match and a negative sample pair which is easy to match are selected from the difficult triple loss, so that the loss is calculated. We measure the distance between two feature vectors by using the euclidean distance, as shown in equation 1. Wherein f is₁And f₂A feature vector representing an image of a pedestrian.

D(f₁,f₂)＝||f₁-f₂||₂ (1)

Suppose that P pedestrian identities are selected in the training set, and K color pedestrian images and K infrared pedestrian images are randomly selected from each pedestrian identity, then 2PK pedestrian images are available in each batch. The difficult triplet penalty for the VI-ReID task is shown in equation 2. Wherein f represents a feature vector set, and the feature vectors of the anchor pedestrian images

Is a collection of two modality pedestrian images,

is and is

A feature vector of a pedestrian image in a color modality or an infrared modality with the same pedestrian identity,

is and is

The pedestrian identity of the pedestrian image in the color mode or the infrared mode is different. When the sum of the euclidean distance between the pair of difficult positive samples and ρ is less than the euclidean distance between the pair of difficult negative samples, the anchor pedestrian image can be correctly matched to all of the pedestrian images in the batch. Where ρ represents an artificially set threshold parameter.

Preferably, the invention proposes identity loss based on the VI-ReiD task. Compared with the traditional identity loss, the identity loss of the invention considers a color mode and an infrared mode, so that the invention is suitable for the VI-ReiD task. Similar to the difficult triple loss for the cross-modal pedestrian re-identification task, we assume that P identities are selected in a training set, and K color pedestrian images and K infrared images are randomly selected from each identityAnd 2PK pedestrian images are contained in each Batch. The loss of identity for the VI-ReID task is shown in equation 3. Wherein f represents a set of feature vectors, p (y)_i|f_i) Representing a feature vector f_iIs predicted as y by the model_iIs encoded by the softmax function.

Preferably, the global level constraint loss proposed by the invention is based on the VI-ReiD task difficult triple loss and identity loss, and the global feature is constrained to ensure the effectiveness of the global feature. Global level constraint penalty takes advantage of the quasi-global feature f of step 3-1_i ^GP-RGBAnd

and global feature f of step 3-3_i ^G-RGBAnd

the expression thereof is shown in formula 4.

Preferably, the local level constraint loss proposed by the invention is similar to the global level constraint loss, and the local characteristics are constrained based on the difficult triple loss and identity loss of the VI-ReiD task, so that the effectiveness of the local characteristics is ensured. Local level constraint penalties utilize the local feature set of step 4-2

And

and local characteristics of step 4-4

And

the expression is shown in formula 5.

Preferably, the local feature constraint global feature loss proposed by the present invention is based on the global feature f of step 3-3_i ^G-RGBAnd

and local characterization of step 4-4

And

designing local features constrains global feature loss. And measuring the spatial correlation between the global feature and the local feature by calculating the mean square error between the global feature and the local feature. Feature vector f of mean square error about two different pedestrian images₁And f₂The expression of (c) is shown in equation 6.

L_MSE(f₁,f₂)＝(||f₁-f₂||₂)² (6)

The space correlation between the global feature and the local feature is measured by the mean square error between the global feature and the local feature, so that the space correlation between the global feature and the local feature is stronger, and the global feature focuses more on the part of the local feature block in the local feature. And each local feature block contains information of different regions, noise such as background, shielding and the like contained in each region is divided, and after the local feature blocks are subjected to operations such as average pooling, noise contained in the corresponding regions is smaller than that in the global feature, so that the influence of the noise is reduced better, and the robustness of the model on the noise is improved. Finally, the local feature constraint global feature penalty is shown in equation 7.

Preferably, the global level and local level joint constraint of the present invention finally combines global level loss, local level loss and local feature constraint global feature loss, as shown in equation 8. Where λ is a predefined trade-off parameter used to balance local feature constraint global feature loss.

L_GLoC＝L_Global+L_Local+λL_LCG (8)

The test process of the invention is as follows:

step 1: inputting a query set (query set) and a gallery set (gallery set), and entering step 2;

step 2: performing feature extraction on all pedestrian images of the query set (query set) and the gallery set (galery set) input in the step 1 by using a model obtained in a training process, and entering a step 3;

and step 3: calculating the similarity between the characteristics of the query set and the characteristics of the image library set, and entering step 4;

and 4, step 4: according to the similarity, obtaining a matching result corresponding to each pedestrian image in the query set, and entering step 5;

and 5: and (6) ending.

Preferably, the query set in step 1 in the test flow represents a set of pedestrian images to be queried, and the gallery set represents a set of pedestrian images to which the query set matches.

Preferably, in step 2 of the test flow, the GLoC-Net model only performs global feature extraction, and takes the global features as final features for representation.

Preferably, the similarity calculation method in step 3 in the test flow is a point-by-point similarity.

Preferably, in step 4 of the test flow, each image in the query set has a plurality of images matched from the gallery set, and the evaluation index is determined according to a Cumulative Matching Characteristic (CMC) and an Average Precision mean (mAP). The Rank-k accuracy in the CMC measures the probability that the correct cross-mode pedestrian image appears in the previous k retrieval results, and the mAP can reflect the average retrieval performance of the method.

Has the advantages that: the invention provides a global level and local level constraint network for learning distinguishable feature representation. Firstly, a residual error network based on a non-local attention mechanism is provided, shared features are extracted, and cross-modal differences are reduced. Secondly, a joint constraint of a global level and a local level is provided, the robustness of noise such as background and shielding is improved, and intra-mode difference is reduced. Particularly, the problem that the position correlation is lacked by a non-local attention mechanism is solved by using a dividing strategy of local features, and the robustness of the model is further improved. Under the condition of indoor and outdoor double scenes in a SYSU-MM01 public data set, the correct recognition rate (Rank-1) and the average accuracy mean (mAP) of the first recognition result of the method are increased by 3.29% and 2.35% compared with the existing best scheme; in a mode of recognizing an infrared image by a color image in a RegDB public data set, Rank-1 and mAP of the method are increased by 4.25 percent and 6.23 percent compared with the prior best scheme; in the RegDB public data set, in the mode of identifying color images as infrared images, Rank-1 and mapp of the method of the present invention are increased by 1.83% and 4.68% over the prior best solution.

Drawings

FIG. 1 is an image of a pedestrian in a color mode and an infrared mode;

FIG. 2 is an image of a pedestrian taken by different color cameras for the same pedestrian;

FIG. 3 is a GLoC-Net network framework;

FIG. 4 is a flow chart of a training phase of the present invention;

FIG. 5 is a non-local attention block diagram;

FIG. 6 is a result of matching the AGW algorithm with the method of the present invention on the SYSU-MM01 dataset;

FIG. 7 is the results of matching the AGW algorithm with the method of the present invention on the RegDB dataset in different modes, wherein (a) is the results of matching the AGW algorithm with the method of the present invention on the RegDB dataset in Visible to Thermal mode, and (b) is the results of matching the AGW algorithm with the method of the present invention on the RegDB dataset in Thermal to Visible mode.

Detailed Description

The technical solutions of the present invention will be described in further detail with reference to the accompanying drawings, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this embodiment, the model of the present invention updates the model parameters in a small-batch Gradient decline (Mini-batch Gradient component) manner, that is, a small batch of samples is randomly selected to update the parameters each time the Gradient declines. The experiment set batch size was 8, i.e. 8 pedestrian identities were randomly selected from the entire data set. 4 color images and 4 infrared images are randomly selected for each pedestrian identity in each batch, and the total number of the color images and the 32 infrared images is contained, so that the constraint of containing the triple loss difficult to sample mine is facilitated. Because the infrared image only has a single channel, the experiment expands the infrared image into three channels which are the same as the color image, namely the infrared image of the single channel is expanded into three channels of infrared images with equal channel values. All images adopt the operations of zero-padding random cropping and random horizontal turning, and the generalization capability of the experiment is further enhanced. The images of both modalities are eventually cropped to a size of 256 x 128, and the normalization of the images is in accordance with the normalization standard of ImageNet.

The penalty function of the present invention is set to the global level and local level joint constraint L described in the summary of the invention_GLoC. Where the local feature constraint global feature loss parameter λ is set to 5, ρ is set to 0.3, P is set to 8, and K is set to 4. On each data set, the experiment set the initial learning rate to 0.01. The experiment is optimized in a random Gradient Descent (SGD) mode, the momentum parameter is set to be 0.9, and the training process of the Warm-Up Learning Rate (Warm Up Learning Rate) strategy of the first 10 rounds is adopted, so that the experiment is performed togetherTraining 80 rounds. The Learning rate Learning _ rate (epoch) varies with the epoch of the training round as shown in equation 9.

Example 1:

the present embodiment will utilize the SYSU-MM01 public dataset to complete the task of cross-modal pedestrian re-identification and test the performance of the model.

The SYSU-MM01 dataset is the first standard dataset in the field of cross-modal pedestrian re-identification, and contains 4 color cameras and 2 near-infrared cameras. The training data for this data set contained 395 pedestrians, including 22258 color images and 11909 infrared images, each pedestrian being captured by at least two cameras of different view angles and positions. The test data contained an additional 95-bit pedestrian.

The test data contained two evaluation patterns, an All-search pattern and an Indoor-search pattern. The Query sets (Query sets) for both modes were identical, containing 3803 infrared images captured from two near-infrared cameras. However, the Gallery sets (Gallery sets) of the two patterns are different. In All-Search mode, the atlas contains All the color images captured by the 4 color cameras. In the Indoor-Search mode, the gallery set contains only color images captured by the color cameras in 2 rooms. Overall, All-Search is more challenging.

The test data comprises two construction modes of a test set, namely single-shot and multi-shot respectively, namely 1 or 10 images of the same pedestrian are randomly selected when the gallery set is constructed.

The present experiment was evaluated in the SYSU-MM01 dataset in the most difficult way, i.e. in the single-shot construction mode, on the evaluation mode of All-search and Indor-search. The results of comparing the performance of each method on the SYSU-MM01 dataset are shown in Table 1.

TABLE 1 comparison of Performance of the method of the present invention on the SYSU-MM01 dataset with other methods

Compared with BDTR only using distance metric learning, the method has higher identification performance, and the Rank-1 value and the mAP value are respectively improved by 23.47% and 22.68%, so that the shared feature is more global by adopting a non-local attention mechanism.

In the comparative experiment of Table 1, the best method other than the method of the present invention is the AGW algorithm. The AGW algorithm adopts a non-local attention mechanism to extract shared features, and improves the performance of the model by using generalized average pooling operation and weighted regularization triple loss. The Rank-1 value and the mAP value of the method are respectively improved by 3.25 percent and 2.35 percent compared with the AGW value.

To further compare with the AGW algorithm, we compared the AGW algorithm with the search results of the method of the present invention on the SYSU-MM01 dataset. Since only Thermal to visual patterns are included in the test phase, we randomly pick 3 example comparisons of them, as shown in fig. 6. Wherein, the green box represents that the matching is correct, and the red box represents that the matching is wrong. Therefore, due to the existence of noise such as background and occlusion in the pedestrian image, the robustness of the AGW algorithm considering only the global feature to the image noise is poor. Each part of the local features designed by the invention only contains the image information of the corresponding region, so that the global influence of noise on the sample is reduced. The global features are constrained by utilizing the spatial correlation between the local features and the global features, so that the global features have the advantages of the local features, the robustness of the global features to noise is improved, and the problem of lack of position correlation caused by a non-local attention mechanism is solved.

Example 2:

the RegDB data set is used to complete the cross-modal pedestrian re-identification task and test the performance of the model.

The RegDB data set is a small-scale data set collected by a dual-mode camera system consisting of a color camera and a far infrared camera. Since the color image and infrared image contours in the RegDB dataset are very similar, the cross-mode matching difficulty is reduced. The data set contained 412 pedestrians, each containing 10 color images and 10 infrared images. In the experiment, 206 pedestrians for training and 2060 corresponding pedestrian images are randomly selected, and the rest 206 pedestrians and 2060 pedestrian images are used for testing. This experiment evaluated two search modes, i.e., color image search infrared image (Visible to Thermal) and infrared image search color image (Thermal to Visible).

In this experiment, the two data sets were randomly divided 10 times, and a training set and a test set were selected to record the average accuracy. The results of the comparison of the individual methods on the RegDB data set are shown in table 2.

TABLE 2 comparison of Performance of the method of the present invention on RegDB data sets with other methods

Compared with the align GAN which simultaneously uses the generation of the countermeasure network and the distance measurement learning, the method of the invention does not adopt the method of the generation of the countermeasure network which introduces extra noise, but adopts the method of the joint constraint of the global level and the local level, so that the characteristics finally extracted by the model have better robustness to the noise. The Rank-1 value and the mAP value in the Visible to Thermal mode are respectively improved by 16.4 percent and 19 percent; the Rank-1 and mAP values in the Thermal to visual mode were increased by 16.02% and 17.18%, respectively.

In the comparative experiments of Table 2, the best method other than the method of the present invention is the AGW algorithm. The method of the invention respectively improves the Rank-1 value and the mAP value by 4.25 percent and 6.23 percent in a color image retrieval infrared image mode, respectively improves the Rank-1 value and the mAP value by 1.83 percent and 4.68 percent in an infrared image retrieval color image mode, but is slightly lower than the AGW method at the Rank-10 position. The reason may be that in the infrared image retrieval color image mode, the infrared image contains less information than the color image, thereby resulting in shared feature information extracted from the infrared image being less than the information in the color image.

To further compare to the AGW algorithm, we compare the AGW algorithm to the search results of the method of the present invention on the RegDB dataset. Wherein, the matching patterns include Visible to Thermal pattern and Thermal to Visible pattern, and we randomly select 3 examples of them for comparison, as shown in fig. 7.

In fig. 7(a), the image to be queried is a color image, which contains abundant color information and texture information, but also contains noise such as background and occlusion, and the matching result of the AGW algorithm considering only global features is poor. In contrast, the method of the present invention has an excellent matching result, which further illustrates that the global level and local level joint constraint method adopted by the present invention has better robustness to noise such as background, occlusion, etc.

In fig. 7(b), the image to be queried is an infrared image. Since the infrared image lacks color information, the recognition difficulty is greatly increased. In this mode, the AGW algorithm mainly recognizes pedestrian images with similar postures and different identities. The method can match the corresponding color pedestrian image according to the limited pedestrian structure information, and further shows that the relationship established between different modes of the same pedestrian identity by the method is firmer.

It can be seen that the method of the present invention performs better than the AGW algorithm in both matching modes.

Example 3:

this embodiment will describe a scenario in which the present invention is applied.

Recently, a case of stealing is taken place in an office building of a school, and an police acquires a picture of a suspect in a monitoring video of a case scene. The picture is an infrared image. Because the image is shot at night, the definition is not high, and the color information is not abundant, the difficulty of finding the identity of the suspect is increased. The proposed method can solve this problem.

Firstly, security personnel acquire a monitoring image of an area needing to be searched, cut out pedestrian images appearing in the monitoring image by utilizing the related technology of pedestrian detection, and take the time recorded by each image, the camera number and the like as image names.

Then, these images are used as a gallery set, and the images of the suspect are used as a query set, and both images are input into the model proposed by the present invention.

Then, the model provided by the invention obtains the output of a group of image sequences, and the shooting time and the camera number are known according to the name of the image, so that the time and the place of the pedestrian are known.

And finally, sequencing the information of the time and the place where the suspect appears according to the time sequence to obtain the time sequence of the suspect appearing for the police to solve the case.

Example 4:

Within a certain program image, many people are present. Quick search for key personnel is an indispensable task for video editing personnel, and the screening by naked eyes is time-consuming and labor-consuming. The proposed method can solve this problem.

First, with the related art of pedestrian detection, images of pedestrians appearing in a program image are cut out, and the time recorded in each image is used as an image name.

Next, these images are used as a gallery set, and the image of the target person is used as a query set, and both images are input to the model proposed by the present invention.

Then, the model provided by the invention obtains the output of a group of image sequences, and the time node of the program image is known according to the name of the image, so that the time of the character.

And finally, sequencing the characters according to the time sequence to obtain the time sequence of the characters for the video editor to use.

Example 5:

The outbreak of new coronary epidemic drove the heart of every person. Nowadays, China implements an epidemic situation normalized prevention and control policy to ensure external prevention input and internal prevention rebound. In recent days, a certain cell receives a notice, and a resident in the cell is a close contact person with an asymptomatic infection case, and the travel track of the resident in recent days needs to be known by using the monitoring image of the cell. Since the resident cannot provide a specific travel route, it is necessary to specifically ascertain the whereabouts of the resident by means of the monitoring image. If only look for the control image by naked eyes, it is time-consuming and laborious, will influence epidemic situation prevention and control efficiency. The proposed method can solve this problem.

Firstly, community personnel acquire a monitoring image in a community, cut out pedestrian images appearing in the monitoring image by utilizing the related technology of pedestrian detection, and take the time recorded by each image, the camera number and the like as image names.

Next, these images are used as a gallery set, and the images of the close-contact resident are used as a query set, and both are input into the model proposed by the present invention.

Then, the model provided by the invention obtains the output of a group of image sequences, and the shooting time and the camera number are known according to the name of the image, so that the time and the place of the resident are known.

And finally, sequencing the time and place information of the residents according to the time sequence to obtain the time sequence of the residents for the community staff and the epidemic prevention staff to use.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and are only illustrative of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and their equivalents.

Claims

1. A cross-modal pedestrian re-identification method based on global level and local level combined constraint is characterized by comprising the following steps:

step 1: inputting images of two modes to a GLoC-Net model, and entering a step 2;

step 2: generating a sharing feature rich in global property from the images of the two modes input in the step 1, and entering a step 3;

and 5: carrying out global level and local level combined constraint on the global characteristics obtained in the step 3 and the local characteristics obtained in the step 4, updating model parameters in a back propagation mode, and entering a step 6;

step 6: and (7) if the specified number of training rounds is reached, continuing to finish the training, and returning to the step 1.

2. The method for cross-modal pedestrian re-identification based on global-level and local-level joint constraints as claimed in claim 1, wherein in step 2, images of two modalities input in step 1 are generated into globally-rich shared features by using a ResNet50 network and a non-local attention block; the method comprises the following specific steps:

step 2-1: inputting images of two modes into a network, and respectively extracting shallow features f of the two modes through a layer of convolution network_i ^Ori-RGBAnd

step 2-2: will f is_i ^Ori-RGBAnd

step 2-3: output sharing featureSign f_i ^Share-RGBAnd

3. the method for cross-modal pedestrian re-identification based on global-level and local-level joint constraints as claimed in claim 2, wherein in step 2-1, the structures of the convolutional layers through which the images of the two modalities pass are the same and the parameters are different.

4. The method of claim 2, wherein two non-local attention blocks are embedded at the second and third convolutional layers of the four convolutional layers of the ResNet50, and are combined into the shared feature extraction network of step 2-2.

5. The method for re-identifying the pedestrian across the modal states based on the global level and the local level joint constraint of claim 1, wherein the step of extracting the global features in the step 3 is as follows:

step 3-1: the shared characteristic f obtained in the step 2_i ^Share-RGBAnd

carrying out global average pooling to obtain a quasi-global feature f rich in global property_i ^GP-RGBAnd

step 3-2: will f is_i ^GP-RGBAnd

inputting BN layer to generate final global feature f_i ^G-RGBAnd

step 3-3: output global feature f_i ^G-RGBAnd

6. the method for identifying pedestrians across modes based on the global level and local level joint constraint of claim 5, wherein in step 3-2, the used BN layer can make the data distribution approximate to normal distribution, avoiding the problem of gradient disappearance.

7. The method for re-identifying the pedestrian across the modal states based on the global level and the local level joint constraint of claim 1, wherein the step of extracting the local features in the step 4 is as follows:

step 4-1: using 1 × 1 convolution operation to combine the shared features f obtained in step 2_i ^Share-RGBAnd

step 4-2: to f_i ^Share-RGB′And

dividing into four equal parts, and performing average pooling operation on each divided part to obtain local feature group composed of four local feature blocks

And

step 4-3: grouping of local features f_i ^LP-RGBAnd

And

step 4-4: outputting local features

And

8. the method for cross-modal pedestrian re-identification based on global-level and local-level joint constraints as claimed in claim 1, wherein the global-level and local-level joint constraints in step 5 are composed of global-level constraint loss, local-level constraint loss and local-feature constraint global feature loss.

9. The cross-modal pedestrian re-identification method based on global-level and local-level joint constraints as claimed in claim 1, wherein the global-level constraint loss and the local-level constraint loss in step 5 are both based on the difficult triple loss and identity loss of the VI-ReID task.

10. The cross-modal pedestrian re-identification method based on the global-level and local-level joint constraints is characterized in that the VI-ReiD task-based difficult triple loss and identity loss take two different modalities into account on the basis of the traditional triple loss and identity loss, and a difficult sampling process is added; the difficult triple loss and the identity loss can perform difficult sampling on the difficult samples of the two modes, and a positive sample pair which is difficult to match and a negative sample pair which is easy to match are selected from the difficult triple loss and the identity loss, so that the loss is calculated.