CN114550210A

CN114550210A - Pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition

Info

Publication number: CN114550210A
Application number: CN202210155715.9A
Authority: CN
Inventors: 查正军; 刘嘉威; 黄志鹏
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-05-27
Anticipated expiration: 2042-02-21
Also published as: CN114550210B

Abstract

The invention discloses a pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition, which comprises the following steps: 1. constructing a self-adaptive mixing module to output a mixing proportion to mix the visible light image and the infrared image to obtain a mixed modal image; 2. inputting the visible light image, the infrared image and the mixed mode image into a feature extraction network, and calculating classification loss and triple loss to update the feature extraction network; 3. constructing an incentive R, and constructing losses of an actor network and a critic network by using a reinforcement learning rule so as to update the network; 4. and calculating a similarity matrix according to the pedestrian characteristics of the search library and the searched library, and obtaining a search result. The method can solve the problems of difficulty and calculation consumption of the traditional generation confrontation model in infrared-visible light image conversion and information loss and difficult fitting of the traditional single-flow double-flow network, so that the pedestrian images in the visible light mode and the infrared mode can be matched more efficiently and accurately.

Description

Pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition

Technical Field

The invention belongs to the field of pedestrian re-identification, and particularly relates to a pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition.

Background

Pedestrian Re-identification (Re-ID) has recently attracted increasing attention due to its wide application in automated tracking and activity analysis. It is intended to capture and identify a target pedestrian from a plurality of different camera views. Pedestrian re-identification is very challenging due to background clutter, occlusion, drastic changes in lighting, differences in body posture, and the like. Most existing pedestrian re-identification methods focus primarily on pedestrian visible light images from visible light cameras and represent the task as a single-modality (visible-visible) matching problem. In recent years, they have made remarkable progress. However, visible light cameras cannot provide useful appearance information in poor lighting environments (such as at night), which limits the applicability of pedestrian re-identification in practical scenes. To address this problem, recent surveillance systems have begun to incorporate infrared cameras to facilitate night surveillance, which presents a new cross-modal matching task called visible-infrared pedestrian re-identification. Given a visible (or infrared) image of the target person, visible-infrared pedestrian re-identification aims to find a corresponding infrared (or visible) image of the same person captured by other spectral cameras. In addition to appearance differences, visible-infrared pedestrian re-identification has significant modal differences, which result from different imaging processes between different spectral cameras (visible and infrared images are heterogeneous in nature, with different wavelength ranges), compared to traditional single-modal pedestrian re-identification. The key solution of visible light-infrared pedestrian re-identification is to close large modal gaps and learn modal-independent discrimination features from visible light and infrared images.

Existing visible light-infrared pedestrian re-identification methods mainly focus on mitigating inherent modal differences at the pixel level or the feature level to extract cross-modal shared features. To mitigate modal differences at the pixel level, these approaches typically design complex generation countermeasure models to perform image-to-image conversions and generate generation samples that are difficult to optimize and noisy. On the other hand, to reduce the modal differences at the feature level, these methods use single-stream or dual-stream networks to extract modality-invariant features through several custom penalties. However, the single-stream network-based approach learns a general network model that lacks the ability to explicitly model individual modalities and ignores the modality-specific features, resulting in critical information loss. A dual-stream network-based approach first abstracts modality-specific information using a separate branching layer for each modality, and then projects modality-specific features into a uniform feature space using a shared network. They completely separate the modeling processes of modality-specific and modality-shared information, and may break important cross-modality shared semantics when extracting modality-specific features. Furthermore, all of the above methods attempt to directly handle such large modal differences and align the two modalities that are parameter sensitive and difficult to converge.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a pedestrian re-identification method based on modal adaptive mixing and invariant convolution decomposition, so that the problems of difficulty and calculation consumption of a traditional generation confrontation model in infrared-visible light image conversion and information loss and difficulty in fitting of a traditional single-flow double-flow network can be solved, and therefore pedestrian images in a visible light mode and an infrared mode can be matched more efficiently and accurately.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses a pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition, which is characterized by comprising the following steps of:

step one, pedestrian data collection and preprocessing:

respectively collecting infrared and visible light monitoring videos of pedestrians by using an infrared camera and an optical camera, and performing pedestrian detection and size normalization pretreatment on the videos frame by frame to obtain an infrared pedestrian image set

And visible pedestrian image set

Wherein the content of the first and second substances,

representing the ith infrared pedestrian image,

representing the ith visible pedestrian image;

is the ith infrared pedestrian map

And ith visible pedestrian image

Respectively setting an ith pedestrian identity ID; is marked as y_iAnd is and

is the number of identities in the training set, m_pRepresents any pedestrian identity ID; thereby constructing a training data set matched with infrared and visible light

Wherein N represents an image in the training datasetThe number of the particles;

step two, self-adaptive image mixing:

2.1, acquiring infrared pedestrian images and visible light pedestrian images of p pedestrian identities ID from the training data set each time, and acquiring k infrared pedestrian images and k visible light pedestrian images by each pedestrian identity ID, thereby obtaining batch processing data formed by 2 XpXk images

Wherein the content of the first and second substances,

representing the jth visible pedestrian image in the batch data,

representing the j-th infrared pedestrian image, y, in the batch data_jThe pedestrian ID representing the jth image in the batch data.

2.2, constructing a feature extraction network based on the ResNet-50 deep learning network;

the ResNet-50 deep learning network comprises 5 stages, wherein the 1 st Stage 0 is n by one layer of convolution kernel₁×n₁The convolution layer of (1), a batch normalization processing layer, a ReLU activation function layer, and the rest 4 stages are all composed of Bottleneck modules; stage 2 contains 3 Bottleneck modules, and the rest 3 stages respectively comprise 4, 6 and 3 Bottleneck modules, wherein each Bottleneck module is n by one layer of convolution kernel₂×n₂One layer of convolution kernel is n₃×n₃Convolution layer, one layer of convolution kernel is n₂×n₂The composition of the convolution layer;

performing modal adaptive decomposition on all convolution kernels in the first three stages of the ResNet-50 deep learning network to obtain three modal base layers alpha corresponding to each convolution kernel^rgb,α^ir,α^mixAnd a mode sharing coefficient layer psi, and then the mode sharing coefficient layer psi and the rest two stages form the feature extraction network together;

step 2.3, the batch processing data

Inputting the two modal base layers alpha corresponding to the two modal base layers alpha passing through a convolution kernel in the first three stages in the feature extraction network^rgb,α^irAfter the convolution processing, the intermediate feature set is output in the third stage after all the convolution kernels are processed through the convolution processing of psi of a shared coefficient layer corresponding to the corresponding convolution kernel

Wherein the content of the first and second substances,

representing an intermediate feature of the jth visible image,

representing an intermediate feature of the jth infrared light image;

step 2.4, constructing an adaptive mixing module consisting of an operator network and a critic network, wherein the operator network and the critic network both comprise: a rolling layer, a pooling layer and two full-connecting layers;

the set of intermediate features

Inputting into the operator network for processing and outputting the mixing ratio

Wherein the content of the first and second substances,

representing six mixing ratios correspondingly generated by jth data of batch processing data;

the j th visible light image

And infrared image

Respectively averagely dividing the images into 6 blocks in the vertical direction, and proportionally mixing the visible light pedestrian images and the infrared pedestrian images after being divided by using the mixing proportion to obtain p multiplied by k mixed modal images

Representing the jth mixed modality image, y_jRepresents its pedestrian identity ID;

step three, updating the feature extraction network for pedestrian re-identification loss:

step 3.1, three-mode data

Inputting the data into the feature extraction network, and processing the data in the first three stages to obtain intermediate features

Representing the intermediate feature of the jth mixed image, and finally outputting the pedestrian feature after the processing of the last two stages

Representing the pedestrian characteristic of the jth visible light image,

representing the pedestrian characteristics of the jth infrared image,

denotes the jth mixing diePedestrian features of the state image;

pedestrian features

After classification processing of a full connection layer, the output result is subjected to a softmax function to obtain the classification probability of the corresponding pedestrian identity

Classification of jth visible image output in batch data as pedestrian identity ID of m_pThe probability of (a) of (b) being,

classification of jth infrared image output in batch data into pedestrian identity ID of m_pThe probability of (a) of (b) being,

classification of mixed modality image output of jth in batch data into pedestrian identity ID of m_pThe probability of (d);

step 3.2, identity loss function L using formula (1)_id：

In the formula (1), y_jRepresenting the correct pedestrian identity ID of the jth visible light image in the batch data, which is also the correct ID of the jth infrared image and the correct pedestrian identity ID of the jth mixed mode image;

and

respectively representing the jth visible light image and the infrared image in the batch processing dataAnd the output of the mixed modality image is classified as the correct pedestrian identity ID of y_jThe probability of (d);

step 3.3, constructing a center triplet loss function of the visible light mode and the infrared mode by using the formula (3), the formula (4) and the formula (5)

Center triplet loss function for visible light modality and mixed modality

Center triplet loss function for infrared and mixed modalities

In the formulae (3), (4) and (5),

respectively represent m < th > in the batch data_pPedestrian feature center, mth of visible light image of individual pedestrian_pPedestrian feature center and mth of infrared image of individual pedestrian_pPedestrian feature centers of mixed-mode images of individual pedestrians, ρ being an edge distance parameter [. sup. ]]₊Max (, 0) denotes a max function;

indicating nth in batch data_pPedestrian feature center or mixture of infrared images of individual pedestriansThe center of the pedestrian feature of the resultant modal image,

indicating nth in batch data_pA pedestrian feature center of a visible light image or a pedestrian feature center of an infrared image of the individual pedestrian or a pedestrian feature center of a mixed modality image;

construction of network total loss function L by using formula (4)_dcn：

3.4, training the feature extraction network by using an Adam optimization strategy based on the training data set until a network total loss function L_dcnUntil convergence, obtaining an optimal feature extraction network;

step four, updating the self-adaptive mixing module by reinforcement learning loss:

step 4.1, constructing a reward R by using the formula (4) and the formula (5):

in the formulae (4) and (5),

representing reward, mAP (star) representing average accuracy index, rank-k (star) representing accuracy index of search result classification of k before ranking, and S is according to

A calculated similarity matrix; epsilon (S) represents the comprehensive index of the similarity matrix S; s^rgb,irTo represent

Similarity matrix calculated therebetween, S^mix,irRepresent

A similarity matrix calculated therebetween, S^ir,rgbTo represent

A similarity matrix calculated therebetween, S^mix ^,rgbTo represent

Similarity matrix calculated between them;

step 4.2, constructing loss functions of the updated operator network by using the formula (6) and the formula (7) respectively

And loss function of critic network

In the formulae (6) and (7),

on behalf of the output of the operator's network,

representing a critical network output, | X | | | non-conducting phosphor²Represents a squared error function;

step 4.3, based on the training data set, using Adam optimization strategy to carry out actor network and critic network of the self-adaptive hybrid module networkTraining is alternately updated until the loss function

And

until convergence, obtaining an optimal self-adaptive mixed module network;

step five, a retrieval process;

step 5.1, respectively extracting pedestrian features of query library by utilizing optimal feature extraction network

And pedestrian characteristics of the queried repository

Wherein, the first and the second end of the pipe are connected with each other,

representing the q-th query image, N_qIndicating the number of query images to be presented,

representing the g-th image, N, in the queried library_gRepresenting the number of images of the queried library;

step 5.2, under the setting of searching infrared pedestrian images by visible light pedestrian images, making the images of the query library be visible light images and the images of the queried library be infrared images;

according to pedestrian characteristics

And

and calculating a similarity matrix, and sequencing the similarity matrix line by line to obtain a final retrieval result.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the method, a mixed mode obtained by self-adaptive mixing is used as an auxiliary mode, the mixed mode is combined with the original infrared and visible light modes to design a three-mode cross-mode pedestrian re-identification solution, and through a mode of convolution decomposition, each convolution decomposes two parts of a mode base and a shared coefficient, so that the mode characteristics and cross-mode invariant characteristics of infrared and mixed mode and visible light mode pedestrians are more fully extracted, and therefore the accuracy of infrared and visible light cross-mode pedestrian retrieval and identification is improved.

2. The invention uses the self-adaptive mixing module to mix the light mode and the infrared mode images to obtain the mixed mode as the auxiliary mode, thus avoiding the difficulty and the calculation consumption in the image conversion of the traditional generation method, ensuring that the obtained auxiliary mode is more reliable and efficient, and improving the accuracy of infrared and visible light cross-mode pedestrian retrieval and identification.

3. The invention uses the convolution decomposition network, not only can process the modal characteristics, but also can fuse the cross-modal invariant characteristics, and has small parameter quantity, thereby solving the problems of information loss and difficult fitting of a multi-path network of the traditional single-flow double-flow network, obtaining more reliable pedestrian characteristics and improving the accuracy of infrared and visible light cross-modal pedestrian retrieval and identification.

Drawings

FIG. 1 is a general flow diagram of the present invention.

Detailed Description

In this embodiment, a flow of a pedestrian re-identification method based on modal adaptive mixing and invariant convolutional decomposition refers to fig. 1, and specifically, the method is performed according to the following steps:

step one, pedestrian data collection and pretreatment:

And visible pedestrian image set

Wherein the content of the first and second substances,

representing the ith infrared pedestrian image,

representing the ith visible pedestrian image;

is the ith infrared pedestrian map

And ith visible pedestrian image

Respectively setting an ith pedestrian identity ID; is marked as y_iAnd is and

Wherein N represents the number of images in the training dataset; in this embodiment, N is 2060,

step two, self-adaptive image mixing:

Wherein the content of the first and second substances,

representing the jth visible pedestrian image in the batch data,

representing the j-th infrared pedestrian image, y, in the batch data_jThe pedestrian ID representing the jth image in the batch data. In this embodiment, p is 8 and k is 4.

the ResNet-50 deep learning network comprises 5 stages, wherein the 1 st Stage 0 is n by one layer of convolution kernel₁×n₁The convolution layer of (1), a batch normalization processing layer, a ReLU activation function layer, and the rest 4 stages are all composed of Bottleneck modules; stage 2 contains 3 Bottleneck modules, and the rest 3 stages respectively comprise 4, 6 and 3 Bottleneck modules, wherein each Bottleneck module is n by one layer of convolution kernel₂×n₂One layer of convolution kernel is n₃×n₃Convolution layer, one layer of convolution kernel is n₂×n₂The composition of the convolution layer; in this example, n₁＝7,n₂＝1,n₃＝3。

Performing modal adaptive decomposition on all convolution kernels in the first three stages of the ResNet-50 deep learning network to obtain three modal base layers alpha corresponding to each convolution kernel^rgb,α^ir,α^mixAnd a modality-specific coefficient layer psi, which together with the other two stages forms a feature extraction network, the modality-adaptive convolution decomposition approximates the convolution kernel as the product of a small set of coefficient layers for simultaneously countering modality differences and performing cross-modality shared semantics at a feature level, the modality-specific base layer being independently learned from the corresponding modality image to model modality changes. They spatially convolve each individual input feature channel for modal difference correction. The common coefficient layer learns all three modalities and performs 1The x 1 convolution weights and sums the corrected output feature channels, facilitating cross-modal sharing semantics. The decomposed convolution network takes visible light, infrared and mixed mode images in a mode self-adaptive mixed module as input, and effectively processes larger mode difference of a characteristic level so as to learn mode invariant characteristics;

step 2.3, batch processing data

In the input feature extraction network, firstly, two modal base layers alpha corresponding to a convolution kernel are firstly passed through in the first three stages^rgb,α^irAfter the convolution processing, the intermediate feature set is output in the third stage after all the convolution kernels are processed through the convolution processing of psi of a shared coefficient layer corresponding to the corresponding convolution kernel

Wherein the content of the first and second substances,

representing an intermediate feature of the jth visible image,

representing an intermediate feature of the jth infrared light image;

step 2.4, constructing an adaptive mixing module consisting of an actor network and a critic network, wherein the actor network and the critic network both comprise: one layer of rolled layer, one layer of pooling layer and two layers of full-connected layers;

dynamic and local linear interpolation across different regions of a modal image is learned in a data-driven manner, and the interpolation can be expressed as a single-step Markov decision process and is realized by an operator-critical agent under a deep Reinforcement Learning (RL) framework.

Intermediate feature set

Inputting into operator network for processing and outputting mixing ratio

Wherein the content of the first and second substances,

the j th visible light image

And infrared image

Respectively and averagely divided into 6 blocks in the vertical direction, and the visible light pedestrian images and the infrared pedestrian images after being divided are proportionally mixed by utilizing a mixing proportion, so that p multiplied by k mixed modal images are obtained

Representing the jth mixed-modality image, y_jRepresents its pedestrian identity ID;

the blend ratio is dynamically adjusted according to modal and appearance differences between corresponding local regions of the visible and infrared images, which are output by the operator network. The actor network in the agent is used to estimate the blend ratio and the critic network in the agent predicts the state action value (Q value).

step 3.1, three-mode data

Inputting the data into a feature extraction network, and processing the data in the first three stages to obtain intermediate features

Representing the intermediate characteristic of the jth mixed image, processing the intermediate characteristic in the last two stages, and finally outputting the pedestrian characteristic

Representing the pedestrian characteristic of the jth visible light image,

representing the pedestrian characteristics of the jth infrared image,

pedestrian features representing a jth mixed-modality image;

pedestrian features

step 3.2,Using the identity loss function L of equation (1)_id：

and

respectively representing the jth visible image, the infrared image and the mixed-mode image in the batch data are classified as having a correct pedestrian identity ID of y_jThe probability of (d);

Center triplet loss function for visible light modality and mixed modality

Center triplet loss function for infrared and mixed modalities

In the formulae (3), (4) and (5),

respectively represent m < th > in the batch data_pPedestrian feature centers of visible light images of individual pedestrians, pedestrian feature centers of infrared images of m _ p pedestrians, and pedestrian feature centers of mixed-mode images of m _ p pedestrians, ρ being an edge distance parameter [. cndot. ]]₊Max (, 0) denotes a max function;

indicating nth in batch data_pA pedestrian feature center of an infrared image of an individual pedestrian or a pedestrian feature center of a mixed-modality image,

indicating nth in batch data_pA pedestrian feature center of a visible light image of an individual pedestrian or a pedestrian feature center of an infrared image or a pedestrian feature center of a mixed modality image.

Construction of network total loss function L by using formula (4)_dcn：

The cross-modal loss function can better utilize the advantages of a mixed mode as an auxiliary mode and bridge the characteristic difference between the modes.

3.4, training the feature extraction network based on the training data set by using an Adam optimization strategy until the total loss function L of the network_dcnUntil convergence, obtaining an optimal feature extraction network;

step 4.1, constructing a reward R by using the formula (4) and the formula (5):

in the formulae (4) and (5),

indicates the reward, the mAP indicates the average precision index, and rank-K indicates the accuracy index of the search result classification of K before ranking, in this embodiment, K is 5, and S is based on

A similarity matrix calculated therebetween, S^mix,irTo represent

A similarity matrix calculated therebetween, S^ir,rgbTo represent

A similarity matrix calculated therebetween, S^mix,rgbTo represent

A similarity matrix calculated between;

And loss function of critic network

In the formulae (6) and (7),

representing the output of the operator network,

4.3, alternately updating and training the actor network and the critic network of the self-adaptive mixed module network based on the training data set by using an Adam optimization strategy until the loss function

And

until convergence, obtaining an optimal self-adaptive mixed module network;

step five, search process

And pedestrian characteristics of the queried repository

Wherein the content of the first and second substances,

representing the q-th query image, N_qIndicates the number of the query images to be displayed,

representing the g-th image, N, in the queried repository_gRepresenting the number of images of the queried library; in this example, N_q＝N_g＝2060。

according to pedestrian characteristics

And

Claims

1. A pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition is characterized by comprising the following steps:

step one, pedestrian data collection and preprocessing:

And visible pedestrian image set

Wherein the content of the first and second substances,

representing the ith infrared pedestrian image,

representing the ith visible pedestrian image;

is the ith infrared pedestrian map

And ith visible pedestrian image

Respectively setting an ith pedestrian identity ID; is marked as y_iAnd is and

Wherein N represents the number of images in the training dataset;

step two, self-adaptive image mixing:

2.1, acquiring infrared pedestrian images and visible light pedestrian images of p pedestrian identities ID from the training data set each time, and acquiring k infrared pedestrian images and k visible light pedestrian images by each pedestrian identity ID, thereby obtaining batch processing data consisting of 2 Xpxk images

Wherein the content of the first and second substances,

representing the jth visible pedestrian image in the batch data,

the ResNet-50 deep learning network comprises 5 stages, wherein the 1 st Stage 0 is n by one layer of convolution kernel₁×n₁The convolution layer of (1), a batch normalization processing layer, a ReLU activation function layer, and the rest 4 stages are all composed of Bottleneck modules; stage 2 contains 3 Bottleneck modules, and the rest 3 stages respectively comprise 4, 6 and 3 Bottleneck modules, wherein each Bottleneck module is n by one layer of convolution kernel₂×n₂One convolution kernel of n₃×n₃Convolution layer, one layer of convolution kernel is n₂×n₂The composition of the convolution layer;

step 2.3, the batch processing data

Wherein the content of the first and second substances,

representing an intermediate feature of the jth visible image,

representing an intermediate feature of the jth infrared light image;

step 2.4, constructing an adaptive mixing module consisting of an actor network and a critic network, wherein the actor network and the critic network both comprise: a rolling layer, a pooling layer and two full-connecting layers;

the intermediate feature set

Wherein the content of the first and second substances,

the j th visible light image

And infrared image

Respectively averagely dividing the images into 6 blocks in the vertical direction, and proportionally mixing the visible light pedestrian images and the infrared pedestrian images after the blocks by using the mixing proportion to obtain p x k mixed modal images

step 3.1, three-mode data

Representing the pedestrian characteristic of the jth visible light image,

representing the pedestrian characteristics of the jth infrared image,

pedestrian features representing a jth mixed-modality image;

pedestrian features

step 3.2, identity loss function L using formula (1)_id：

and

Center triplet loss function for visible light modality and mixed modality

Center triplet loss content for infrared and mixed modalitiesNumber of

In the formulae (3), (4) and (5),

respectively represent m < th > in the batch data_pPedestrian feature center, mth of visible light image of individual pedestrian_pPedestrian feature center and m-th of infrared image of individual pedestrian_pPedestrian feature center of mixed-mode image of individual pedestrian, ρ is edge distance parameter [. dot. ]]₊Max (, 0) denotes a max function;

indicating mth in batch data_pA pedestrian feature center of an infrared image of an individual pedestrian or a pedestrian feature center of a mixed-modality image,

construction of network total loss function L by using formula (4)_dcn：

step 4.1, constructing a reward R by using the formula (4) and the formula (5):

in the formula (4) and the formula (5), R represents reward, mAP (. + -.) represents average accuracy index, rank-k (. + -.) represents accuracy index of search result classification of k before ranking, and S is based on

Similarity matrix calculated therebetween, S^mix,irTo represent

A similarity matrix calculated therebetween, S^ir,rgbTo represent

A similarity matrix calculated therebetween, S^mix ^,rgbRepresent

A similarity matrix calculated between;

And loss function of critic network

In the formulae (6) and (7),

on behalf of the output of the operator's network,

and 4.3, alternately updating and training the actor network and the critic network of the self-adaptive hybrid module network by using an Adam optimization strategy based on the training data set until a loss function

And

until convergence, obtaining an optimal self-adaptive mixed module network;

step five, a retrieval process;

And pedestrian characteristics of the queried repository

Wherein the content of the first and second substances,

representing the g-th image, N, in the queried repository_gRepresenting the number of images of the queried library;

according to pedestrian characteristics

And