CN115497121A

CN115497121A - Cross-modal pedestrian re-identification method based on space-time characteristics and different center loss

Info

Publication number: CN115497121A
Application number: CN202211169495.1A
Authority: CN
Inventors: 张强; 苏鹏; 刘瑞; 周东生
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2022-12-20

Abstract

The invention provides a cross-modal pedestrian re-recognition method based on space-time characteristics and different center losses. The invention simultaneously extracts the channel characteristics and the spatial characteristics, so that the representation of the pedestrian is more identifiable, and the characteristic distribution of the same pedestrian characteristic is more compact by using the different-center sample loss.

Description

Cross-modal pedestrian re-identification method based on space-time characteristics and different center loss

Technical Field

The invention belongs to the field of image retrieval algorithms, relates to a pedestrian re-identification technology based on cross-modal images, and particularly relates to a cross-modal pedestrian re-identification method based on space-time characteristics and different center loss.

Background

The pedestrian re-identification is an important branch in the field of image retrieval, and has wide application in the aspects of ensuring social public safety and the like. With the appearance of more and more monitoring devices integrated with infrared acquisition devices, cross-mode pedestrian re-identification by using visible light images and infrared images becomes an important research direction.

Compared with the traditional single-mode pedestrian re-identification, the cross-mode pedestrian re-identification is more challenging. In addition to factors such as view angle change, occlusion, posture change, etc., cross-modal pedestrian re-identification also faces the challenge of huge modal differences. In order to reduce the influence caused by the mode difference, some methods adopt a mode conversion mode to convert the infrared image and the visible light image into the same mode. Other methods use a feature learning approach to align the image features of two modalities in a uniform feature space.

The existing cross-modal pedestrian re-identification method utilizes a single-flow network to connect a mode alignment module to extract image detail characteristics, and then utilizes cross entropy loss and central cluster loss to train network parameters, so as to achieve the purpose of cross-modal pedestrian re-identification. However, the center cluster loss, while better handling the differences between modalities, does not constrain the intra-modality differences tightly; the modality alignment module extracts detailed features from the channel level, but there are some key detailed features available at the spatial level. If this information can be utilized simultaneously, the performance of the cross-modal pedestrian re-identification model can be further improved.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a cross-modal pedestrian re-identification method based on space-time characteristics and different center loss, channel characteristics and space characteristics are extracted simultaneously, so that the representation of pedestrians is more identifiability, and different center sample loss is used, so that the characteristic distribution of the same pedestrian characteristic is more compact.

The technical scheme adopted by the invention for solving the technical problem is as follows: a cross-modal pedestrian re-recognition method based on space-time characteristics and different center losses is characterized in that a cross-modal pedestrian re-recognition model extracts pedestrian image characteristics from space, channels and overall dimensions, and a cross-modal pedestrian re-recognition model is trained by adopting a loss function consisting of identity loss, different center sample loss, center cluster loss and total characteristic loss, so that cross-modal pedestrian re-recognition of a pedestrian image is realized.

As a further embodiment of the invention, the method specifically comprises the following steps:

s1: constructing a cross-modal pedestrian re-identification model, wherein the cross-modal pedestrian re-identification model comprises a modal alignment module and a spatial feature extraction module which are connected in parallel;

s2: performing data enhancement on the pedestrian image input into the cross-mode pedestrian re-identification model;

s3: extracting global features, spatial local features and channel local features of the pedestrian image by using the cross-modal pedestrian re-identification model;

s4: calculating the different-center sample loss and the identity loss of the extracted spatial local features, and calculating the center cluster loss and the identity loss of the extracted channel local features;

s5: splicing the global features, the spatial local features and the channel local features to be used as total features, and calculating total feature loss;

s6: adding all losses in the steps S4 and S5 to form a loss function of the whole cross-modal pedestrian re-recognition model, and training and optimizing parameters in the cross-modal pedestrian re-recognition model according to the loss function;

s7: after the training of the cross-modal pedestrian re-recognition model is completed, inputting the pedestrian image to be inquired and the test concentrated image into the cross-modal pedestrian re-recognition model, calculating the similarity between the pedestrian image and the test concentrated image, and returning the M values with the highest similarity, namely the result of the cross-modal pedestrian re-recognition.

As a further embodiment of the present invention, the data enhancement on the pedestrian image in step S2 specifically includes: the visible light images and the infrared images are mixed to form a plurality of batches.

As a further embodiment of the present invention, the off-center sample loss of the extracted spatial local feature is calculated in step S4, and the calculation formula is as follows:

wherein L is _HCS Represents the off-center sample loss, ρ is the margin parameter, δ is a balance coefficient, [ x ]] ₊ = max (x, 0) standard hinge loss, | | x _a -x _b || ₂ Denotes x _a And x _b A second norm therebetween; p represents the total number of different classes in a mini-batch,

and

respectively representing the characteristic representation of the jth visible light image and the jth infrared image of the class i,

and

respectively representing the central characteristics of a visible light mode and an infrared mode class i in a mini-batch;

and

the average value of all samples of the class i in each mode is obtained, and the calculation formula is as follows:

wherein K represents that the number of visible light images and infrared images in one mini-batch is K.

The beneficial effects of the invention include: the method is characterized in that a spatial feature extraction module is connected in parallel on the basis of the original mode alignment module, and the channel features and the spatial features are extracted at the same time, so that the representation of pedestrians is more identifiable; meanwhile, the different-center sample loss is used for simultaneously processing the cross-modal and intra-modal changes, so that the feature distribution of the same pedestrian feature is more compact. The method has reference significance for exploring the feature representation of the pedestrian and improving the accuracy of pedestrian re-identification.

Drawings

FIG. 1 is a diagram of a model framework of the present invention;

FIG. 2 is an illustration of off-center sample loss according to the present invention;

FIG. 3 is a graph of the effect of off-center sample loss versus t-SNE.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Furthermore, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1

A cross-modal pedestrian re-identification method based on space-time characteristics and different center losses is characterized in that a space characteristic extraction structure is connected in parallel to an original cross-modal pedestrian re-identification model to construct a new cross-modal pedestrian re-identification model, and FIG. 1 is a model framework diagram of the invention.

First, data enhancement is performed on the infrared image and the visible light image in the data set. They are then fed into a single stream network consisting of Resnet50 and MPM, resulting in an embedded profile X. Next, using the three branches, spatial features, channel features, and global feature representations of the embedded feature map X are extracted, respectively. And finally, respectively using the identity loss and the related cross-mode triple loss for the three feature representations, and optimizing the parameters of the network to obtain a cross-mode pedestrian re-identification model.

The method specifically comprises the following technical steps:

1. and constructing a network model and initializing network parameters.

The main network of the invention is a single-flow network composed of Resnet50 and MAM, and the main purpose of the network is to extract pedestrian characteristics of a coarse granularity

Wherein, C, H, W represent the channel number, length and width of the characteristic diagram respectively. Resnet50 initializes the network parameters with the pre-training parameters on ImageNet and changes stride to 1 to remove the last volume block of the last downsampled layer, i.e., layer 4. MAM is essentially an example normalized structure (specifically, see: qiang Wu, pingyang Dai, jie Chen, chia-Wen Lin, yongjian Wu, feiyue Huang, bineng Zhong, and Rongrong Ji. Discovery cross-modulation numbers for visual-induced person-identification. In Proceedings of the IEEE/CVF Conference company Vision and Pattern Recognition, pages4330-4339,2021. Nether) that is embedded behind the layer3 and layer4 layers of Res 50. Then, there are three branches behind the backbone network, one is a spatial feature extraction module branch, one is a channel feature extraction module branch, and the last is a global feature extraction branch.

And (3) spatial feature branching, namely averagely dividing the pedestrian features X with coarse granularity into p small blocks, and extracting spatial local features from each small block by operations of pooling, 1 × 1 convolution, remodeling and the like. Specifically, it can be expressed as:

S _i ＝Reshape(Conv(pool(X _i )))

wherein,

a feature map for each patch is shown. And finally, connecting all the spatial local features to be used as the final feature representation of the spatial feature extraction module.

The channel feature branch is mainly composed of a mode alignment module and a 1 × 1 convolution down-sampling layer.

And the global feature extraction branch performs pooling and reshaping operation on the coarse-grained pedestrian features X. And finally, connecting the pedestrian characteristics of the three branches to be used as a representation of the final pedestrian characteristic.

2. Image pre-processing

Images in the data set were randomly cropped to 288 × 144 images, and then the diversity of the data was increased using random horizontal inversion and random graying. For the visible light image, a local random channel enhancement is additionally used, namely a region in the visible light image is randomly selected, and the pixel value of the region is randomly replaced by one of a gray value, a pixel value of an R channel, a pixel value of a G channel and a pixel value of a B channel. This is done to force the model to be less sensitive to color changes, facilitating its learning of color-independent features.

3. Loss function

The invention uses various loss functions in the process of training the network model. Specifically including off-center sample loss, center cluster loss, identity loss, and other losses.

The off-center sample loss is an improvement over the conventional triple loss, which optimizes both the inter-class center distance and the sample-to-class center distance. As shown in fig. 2, the boxes and circles represent different classes, the open represents each sample feature, and the filled represents the center feature calculated from the sample. In a mini-batch, on one hand, the different-center sample loss enables the center features of the same type and different modes to be close to each other, and the center features of different types to be far away from each other; on the other hand, each sample is made as close as possible to the center of the class. Its formula is as follows:

where ρ is the margin parameter and δ is a balance coefficient. [ x ]] ₊ = max (X, 0) standard hinge loss, | X _a -x _b || ₂ Represents x _a And x _b Two norms in between. P represents the total number of different classes in a mini-batch.

And

and respectively representing the characteristics of the jth visible light image and the jth infrared image of the class i.

And

respectively representing the central characteristics of the visible light mode and the infrared mode class i in a mini-batch, and the central characteristics are obtained by averaging samples of all classes i in respective modes, and the calculation formula is as follows:

and K represents that the number of the visible light pictures and the number of the infrared pictures in one mini-batch are both K. The off-center sample loss makes the sample distribution more compact, and its effect compared to the conventional triple loss is shown in fig. 3.

The identity loss is essentially cross entropy loss, cross-modal pedestrian re-identification is regarded as a multi-classification problem, and each identity of a pedestrian is regarded as a class.

The central cluster losses and other losses are those used in the literature referred to above. The total loss used to remove the center cluster loss is denoted as the other loss.

For spatial feature extraction module branches, the invention uses identity loss and off-center sample loss; for the channel feature extraction module branches, identity loss and central cluster loss are used; for the global feature extraction branch, other penalties are used.

According to the method, the local features of the pedestrians are extracted from the space angle, the channel angle and the global angle according to the requirements of the cross-modal pedestrian re-identification task, the feature representation of the pedestrians is enhanced, the distances among different modal features are shortened, and the matching accuracy of the cross-modal pedestrian re-identification task is improved.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications derived therefrom are intended to be within the scope of the invention.

Claims

1. A cross-modal pedestrian re-recognition method based on space-time characteristics and different center losses is characterized in that a cross-modal pedestrian re-recognition model extracts pedestrian image characteristics from space, channels and overall dimensions, and a loss function consisting of identity loss, different center sample loss, center cluster loss and total characteristic loss is adopted to train the cross-modal pedestrian re-recognition model, so that cross-modal pedestrian re-recognition of a pedestrian image is achieved.

2. The cross-modal pedestrian re-identification method based on spatiotemporal features and different center losses according to claim 1, characterized by comprising the following steps:

s2: performing data enhancement on the pedestrian image input with the cross-mode pedestrian re-identification model;

s4: calculating the heterocentric sample loss and the identity loss of the extracted spatial local features, and calculating the central cluster loss and the identity loss of the extracted channel local features;

3. The cross-modal pedestrian re-identification method based on spatio-temporal features and different center loss according to claim 2, wherein the data enhancement is performed on the pedestrian image in step S2, specifically: the visible light images and the infrared images are mixed to form a plurality of batches.

4. The cross-modal pedestrian re-identification method based on spatio-temporal features and different-center loss according to claim 3, wherein the different-center sample loss of the extracted spatial local features is calculated in step S4, and the calculation formula is as follows:

wherein L is _HCS Representing iso-centric samplesLoss, ρ is the margin parameter, δ is a balance coefficient, [ x ]] ₊ = max (x, 0) standard hinge loss, | x _a -x _b || ₂ Denotes x _a And x _b A second norm therebetween; p represents the total number of different classes in a mini-batch,

and

respectively representing the characteristics of the jth visible light image and the jth infrared image of the class i,

and

and