CN114550210A - Pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition - Google Patents

Pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition Download PDF

Info

Publication number
CN114550210A
CN114550210A CN202210155715.9A CN202210155715A CN114550210A CN 114550210 A CN114550210 A CN 114550210A CN 202210155715 A CN202210155715 A CN 202210155715A CN 114550210 A CN114550210 A CN 114550210A
Authority
CN
China
Prior art keywords
pedestrian
image
infrared
network
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210155715.9A
Other languages
Chinese (zh)
Other versions
CN114550210B (en
Inventor
查正军
刘嘉威
黄志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202210155715.9A priority Critical patent/CN114550210B/en
Publication of CN114550210A publication Critical patent/CN114550210A/en
Application granted granted Critical
Publication of CN114550210B publication Critical patent/CN114550210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition, which comprises the following steps: 1. constructing a self-adaptive mixing module to output a mixing proportion to mix the visible light image and the infrared image to obtain a mixed modal image; 2. inputting the visible light image, the infrared image and the mixed mode image into a feature extraction network, and calculating classification loss and triple loss to update the feature extraction network; 3. constructing an incentive R, and constructing losses of an actor network and a critic network by using a reinforcement learning rule so as to update the network; 4. and calculating a similarity matrix according to the pedestrian characteristics of the search library and the searched library, and obtaining a search result. The method can solve the problems of difficulty and calculation consumption of the traditional generation confrontation model in infrared-visible light image conversion and information loss and difficult fitting of the traditional single-flow double-flow network, so that the pedestrian images in the visible light mode and the infrared mode can be matched more efficiently and accurately.

Description

Pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition
Technical Field
The invention belongs to the field of pedestrian re-identification, and particularly relates to a pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition.
Background
Pedestrian Re-identification (Re-ID) has recently attracted increasing attention due to its wide application in automated tracking and activity analysis. It is intended to capture and identify a target pedestrian from a plurality of different camera views. Pedestrian re-identification is very challenging due to background clutter, occlusion, drastic changes in lighting, differences in body posture, and the like. Most existing pedestrian re-identification methods focus primarily on pedestrian visible light images from visible light cameras and represent the task as a single-modality (visible-visible) matching problem. In recent years, they have made remarkable progress. However, visible light cameras cannot provide useful appearance information in poor lighting environments (such as at night), which limits the applicability of pedestrian re-identification in practical scenes. To address this problem, recent surveillance systems have begun to incorporate infrared cameras to facilitate night surveillance, which presents a new cross-modal matching task called visible-infrared pedestrian re-identification. Given a visible (or infrared) image of the target person, visible-infrared pedestrian re-identification aims to find a corresponding infrared (or visible) image of the same person captured by other spectral cameras. In addition to appearance differences, visible-infrared pedestrian re-identification has significant modal differences, which result from different imaging processes between different spectral cameras (visible and infrared images are heterogeneous in nature, with different wavelength ranges), compared to traditional single-modal pedestrian re-identification. The key solution of visible light-infrared pedestrian re-identification is to close large modal gaps and learn modal-independent discrimination features from visible light and infrared images.
Existing visible light-infrared pedestrian re-identification methods mainly focus on mitigating inherent modal differences at the pixel level or the feature level to extract cross-modal shared features. To mitigate modal differences at the pixel level, these approaches typically design complex generation countermeasure models to perform image-to-image conversions and generate generation samples that are difficult to optimize and noisy. On the other hand, to reduce the modal differences at the feature level, these methods use single-stream or dual-stream networks to extract modality-invariant features through several custom penalties. However, the single-stream network-based approach learns a general network model that lacks the ability to explicitly model individual modalities and ignores the modality-specific features, resulting in critical information loss. A dual-stream network-based approach first abstracts modality-specific information using a separate branching layer for each modality, and then projects modality-specific features into a uniform feature space using a shared network. They completely separate the modeling processes of modality-specific and modality-shared information, and may break important cross-modality shared semantics when extracting modality-specific features. Furthermore, all of the above methods attempt to directly handle such large modal differences and align the two modalities that are parameter sensitive and difficult to converge.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a pedestrian re-identification method based on modal adaptive mixing and invariant convolution decomposition, so that the problems of difficulty and calculation consumption of a traditional generation confrontation model in infrared-visible light image conversion and information loss and difficulty in fitting of a traditional single-flow double-flow network can be solved, and therefore pedestrian images in a visible light mode and an infrared mode can be matched more efficiently and accurately.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention discloses a pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition, which is characterized by comprising the following steps of:
step one, pedestrian data collection and preprocessing:
respectively collecting infrared and visible light monitoring videos of pedestrians by using an infrared camera and an optical camera, and performing pedestrian detection and size normalization pretreatment on the videos frame by frame to obtain an infrared pedestrian image set
Figure BDA0003512450130000021
And visible pedestrian image set
Figure BDA0003512450130000022
Wherein the content of the first and second substances,
Figure BDA0003512450130000023
representing the ith infrared pedestrian image,
Figure BDA0003512450130000024
representing the ith visible pedestrian image;
is the ith infrared pedestrian map
Figure BDA0003512450130000025
And ith visible pedestrian image
Figure BDA0003512450130000026
Respectively setting an ith pedestrian identity ID; is marked as yiAnd is and
Figure BDA0003512450130000027
Figure BDA0003512450130000028
is the number of identities in the training set, mpRepresents any pedestrian identity ID; thereby constructing a training data set matched with infrared and visible light
Figure BDA0003512450130000029
Wherein N represents an image in the training datasetThe number of the particles;
step two, self-adaptive image mixing:
2.1, acquiring infrared pedestrian images and visible light pedestrian images of p pedestrian identities ID from the training data set each time, and acquiring k infrared pedestrian images and k visible light pedestrian images by each pedestrian identity ID, thereby obtaining batch processing data formed by 2 XpXk images
Figure BDA00035124501300000210
Wherein the content of the first and second substances,
Figure BDA00035124501300000211
representing the jth visible pedestrian image in the batch data,
Figure BDA00035124501300000212
representing the j-th infrared pedestrian image, y, in the batch datajThe pedestrian ID representing the jth image in the batch data.
2.2, constructing a feature extraction network based on the ResNet-50 deep learning network;
the ResNet-50 deep learning network comprises 5 stages, wherein the 1 st Stage 0 is n by one layer of convolution kernel1×n1The convolution layer of (1), a batch normalization processing layer, a ReLU activation function layer, and the rest 4 stages are all composed of Bottleneck modules; stage 2 contains 3 Bottleneck modules, and the rest 3 stages respectively comprise 4, 6 and 3 Bottleneck modules, wherein each Bottleneck module is n by one layer of convolution kernel2×n2One layer of convolution kernel is n3×n3Convolution layer, one layer of convolution kernel is n2×n2The composition of the convolution layer;
performing modal adaptive decomposition on all convolution kernels in the first three stages of the ResNet-50 deep learning network to obtain three modal base layers alpha corresponding to each convolution kernelrgbirmixAnd a mode sharing coefficient layer psi, and then the mode sharing coefficient layer psi and the rest two stages form the feature extraction network together;
step 2.3, the batch processing data
Figure BDA0003512450130000031
Inputting the two modal base layers alpha corresponding to the two modal base layers alpha passing through a convolution kernel in the first three stages in the feature extraction networkrgbirAfter the convolution processing, the intermediate feature set is output in the third stage after all the convolution kernels are processed through the convolution processing of psi of a shared coefficient layer corresponding to the corresponding convolution kernel
Figure BDA0003512450130000032
Wherein the content of the first and second substances,
Figure BDA0003512450130000033
representing an intermediate feature of the jth visible image,
Figure BDA0003512450130000034
representing an intermediate feature of the jth infrared light image;
step 2.4, constructing an adaptive mixing module consisting of an operator network and a critic network, wherein the operator network and the critic network both comprise: a rolling layer, a pooling layer and two full-connecting layers;
the set of intermediate features
Figure BDA0003512450130000035
Inputting into the operator network for processing and outputting the mixing ratio
Figure BDA0003512450130000036
Wherein the content of the first and second substances,
Figure BDA0003512450130000037
representing six mixing ratios correspondingly generated by jth data of batch processing data;
the j th visible light image
Figure BDA0003512450130000038
And infrared image
Figure BDA0003512450130000039
Respectively averagely dividing the images into 6 blocks in the vertical direction, and proportionally mixing the visible light pedestrian images and the infrared pedestrian images after being divided by using the mixing proportion to obtain p multiplied by k mixed modal images
Figure BDA00035124501300000310
Figure BDA00035124501300000311
Representing the jth mixed modality image, yjRepresents its pedestrian identity ID;
step three, updating the feature extraction network for pedestrian re-identification loss:
step 3.1, three-mode data
Figure BDA00035124501300000312
Inputting the data into the feature extraction network, and processing the data in the first three stages to obtain intermediate features
Figure BDA00035124501300000313
Figure BDA00035124501300000314
Representing the intermediate feature of the jth mixed image, and finally outputting the pedestrian feature after the processing of the last two stages
Figure BDA00035124501300000315
Figure BDA00035124501300000316
Representing the pedestrian characteristic of the jth visible light image,
Figure BDA00035124501300000317
representing the pedestrian characteristics of the jth infrared image,
Figure BDA00035124501300000318
denotes the jth mixing diePedestrian features of the state image;
pedestrian features
Figure BDA00035124501300000319
After classification processing of a full connection layer, the output result is subjected to a softmax function to obtain the classification probability of the corresponding pedestrian identity
Figure BDA00035124501300000320
Figure BDA00035124501300000321
Classification of jth visible image output in batch data as pedestrian identity ID of mpThe probability of (a) of (b) being,
Figure BDA00035124501300000322
classification of jth infrared image output in batch data into pedestrian identity ID of mpThe probability of (a) of (b) being,
Figure BDA00035124501300000323
classification of mixed modality image output of jth in batch data into pedestrian identity ID of mpThe probability of (d);
step 3.2, identity loss function L using formula (1)id
Figure BDA0003512450130000041
In the formula (1), yjRepresenting the correct pedestrian identity ID of the jth visible light image in the batch data, which is also the correct ID of the jth infrared image and the correct pedestrian identity ID of the jth mixed mode image;
Figure BDA0003512450130000042
and
Figure BDA0003512450130000043
respectively representing the jth visible light image and the infrared image in the batch processing dataAnd the output of the mixed modality image is classified as the correct pedestrian identity ID of yjThe probability of (d);
step 3.3, constructing a center triplet loss function of the visible light mode and the infrared mode by using the formula (3), the formula (4) and the formula (5)
Figure BDA0003512450130000044
Center triplet loss function for visible light modality and mixed modality
Figure BDA0003512450130000045
Center triplet loss function for infrared and mixed modalities
Figure BDA0003512450130000046
Figure BDA0003512450130000047
Figure BDA0003512450130000048
Figure BDA0003512450130000049
In the formulae (3), (4) and (5),
Figure BDA00035124501300000410
respectively represent m < th > in the batch datapPedestrian feature center, mth of visible light image of individual pedestrianpPedestrian feature center and mth of infrared image of individual pedestrianpPedestrian feature centers of mixed-mode images of individual pedestrians, ρ being an edge distance parameter [. sup. ]]+Max (, 0) denotes a max function;
Figure BDA00035124501300000411
indicating nth in batch datapPedestrian feature center or mixture of infrared images of individual pedestriansThe center of the pedestrian feature of the resultant modal image,
Figure BDA00035124501300000412
indicating nth in batch datapA pedestrian feature center of a visible light image or a pedestrian feature center of an infrared image of the individual pedestrian or a pedestrian feature center of a mixed modality image;
construction of network total loss function L by using formula (4)dcn
Figure BDA0003512450130000051
3.4, training the feature extraction network by using an Adam optimization strategy based on the training data set until a network total loss function LdcnUntil convergence, obtaining an optimal feature extraction network;
step four, updating the self-adaptive mixing module by reinforcement learning loss:
step 4.1, constructing a reward R by using the formula (4) and the formula (5):
Figure BDA0003512450130000052
Figure BDA0003512450130000053
in the formulae (4) and (5),
Figure BDA00035124501300000517
representing reward, mAP (star) representing average accuracy index, rank-k (star) representing accuracy index of search result classification of k before ranking, and S is according to
Figure BDA0003512450130000054
A calculated similarity matrix; epsilon (S) represents the comprehensive index of the similarity matrix S; srgb,irTo represent
Figure BDA0003512450130000055
Similarity matrix calculated therebetween, Smix,irRepresent
Figure BDA0003512450130000056
A similarity matrix calculated therebetween, Sir,rgbTo represent
Figure BDA0003512450130000057
A similarity matrix calculated therebetween, Smix ,rgbTo represent
Figure BDA0003512450130000058
Similarity matrix calculated between them;
step 4.2, constructing loss functions of the updated operator network by using the formula (6) and the formula (7) respectively
Figure BDA0003512450130000059
And loss function of critic network
Figure BDA00035124501300000510
Figure BDA00035124501300000511
Figure BDA00035124501300000512
In the formulae (6) and (7),
Figure BDA00035124501300000513
on behalf of the output of the operator's network,
Figure BDA00035124501300000514
representing a critical network output, | X | | | non-conducting phosphor2Represents a squared error function;
step 4.3, based on the training data set, using Adam optimization strategy to carry out actor network and critic network of the self-adaptive hybrid module networkTraining is alternately updated until the loss function
Figure BDA00035124501300000515
And
Figure BDA00035124501300000516
until convergence, obtaining an optimal self-adaptive mixed module network;
step five, a retrieval process;
step 5.1, respectively extracting pedestrian features of query library by utilizing optimal feature extraction network
Figure BDA0003512450130000061
And pedestrian characteristics of the queried repository
Figure BDA0003512450130000062
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003512450130000063
representing the q-th query image, NqIndicating the number of query images to be presented,
Figure BDA0003512450130000064
representing the g-th image, N, in the queried librarygRepresenting the number of images of the queried library;
step 5.2, under the setting of searching infrared pedestrian images by visible light pedestrian images, making the images of the query library be visible light images and the images of the queried library be infrared images;
according to pedestrian characteristics
Figure BDA0003512450130000065
And
Figure BDA0003512450130000066
and calculating a similarity matrix, and sequencing the similarity matrix line by line to obtain a final retrieval result.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the method, a mixed mode obtained by self-adaptive mixing is used as an auxiliary mode, the mixed mode is combined with the original infrared and visible light modes to design a three-mode cross-mode pedestrian re-identification solution, and through a mode of convolution decomposition, each convolution decomposes two parts of a mode base and a shared coefficient, so that the mode characteristics and cross-mode invariant characteristics of infrared and mixed mode and visible light mode pedestrians are more fully extracted, and therefore the accuracy of infrared and visible light cross-mode pedestrian retrieval and identification is improved.
2. The invention uses the self-adaptive mixing module to mix the light mode and the infrared mode images to obtain the mixed mode as the auxiliary mode, thus avoiding the difficulty and the calculation consumption in the image conversion of the traditional generation method, ensuring that the obtained auxiliary mode is more reliable and efficient, and improving the accuracy of infrared and visible light cross-mode pedestrian retrieval and identification.
3. The invention uses the convolution decomposition network, not only can process the modal characteristics, but also can fuse the cross-modal invariant characteristics, and has small parameter quantity, thereby solving the problems of information loss and difficult fitting of a multi-path network of the traditional single-flow double-flow network, obtaining more reliable pedestrian characteristics and improving the accuracy of infrared and visible light cross-modal pedestrian retrieval and identification.
Drawings
FIG. 1 is a general flow diagram of the present invention.
Detailed Description
In this embodiment, a flow of a pedestrian re-identification method based on modal adaptive mixing and invariant convolutional decomposition refers to fig. 1, and specifically, the method is performed according to the following steps:
step one, pedestrian data collection and pretreatment:
respectively collecting infrared and visible light monitoring videos of pedestrians by using an infrared camera and an optical camera, and performing pedestrian detection and size normalization pretreatment on the videos frame by frame to obtain an infrared pedestrian image set
Figure BDA0003512450130000067
And visible pedestrian image set
Figure BDA0003512450130000068
Wherein the content of the first and second substances,
Figure BDA0003512450130000069
representing the ith infrared pedestrian image,
Figure BDA00035124501300000610
representing the ith visible pedestrian image;
is the ith infrared pedestrian map
Figure BDA0003512450130000071
And ith visible pedestrian image
Figure BDA0003512450130000072
Respectively setting an ith pedestrian identity ID; is marked as yiAnd is and
Figure BDA0003512450130000073
Figure BDA0003512450130000074
is the number of identities in the training set, mpRepresents any pedestrian identity ID; thereby constructing a training data set matched with infrared and visible light
Figure BDA0003512450130000075
Wherein N represents the number of images in the training dataset; in this embodiment, N is 2060,
Figure BDA0003512450130000076
step two, self-adaptive image mixing:
2.1, acquiring infrared pedestrian images and visible light pedestrian images of p pedestrian identities ID from the training data set each time, and acquiring k infrared pedestrian images and k visible light pedestrian images by each pedestrian identity ID, thereby obtaining batch processing data formed by 2 XpXk images
Figure BDA0003512450130000077
Wherein the content of the first and second substances,
Figure BDA0003512450130000078
representing the jth visible pedestrian image in the batch data,
Figure BDA0003512450130000079
representing the j-th infrared pedestrian image, y, in the batch datajThe pedestrian ID representing the jth image in the batch data. In this embodiment, p is 8 and k is 4.
2.2, constructing a feature extraction network based on the ResNet-50 deep learning network;
the ResNet-50 deep learning network comprises 5 stages, wherein the 1 st Stage 0 is n by one layer of convolution kernel1×n1The convolution layer of (1), a batch normalization processing layer, a ReLU activation function layer, and the rest 4 stages are all composed of Bottleneck modules; stage 2 contains 3 Bottleneck modules, and the rest 3 stages respectively comprise 4, 6 and 3 Bottleneck modules, wherein each Bottleneck module is n by one layer of convolution kernel2×n2One layer of convolution kernel is n3×n3Convolution layer, one layer of convolution kernel is n2×n2The composition of the convolution layer; in this example, n1=7,n2=1,n3=3。
Performing modal adaptive decomposition on all convolution kernels in the first three stages of the ResNet-50 deep learning network to obtain three modal base layers alpha corresponding to each convolution kernelrgbirmixAnd a modality-specific coefficient layer psi, which together with the other two stages forms a feature extraction network, the modality-adaptive convolution decomposition approximates the convolution kernel as the product of a small set of coefficient layers for simultaneously countering modality differences and performing cross-modality shared semantics at a feature level, the modality-specific base layer being independently learned from the corresponding modality image to model modality changes. They spatially convolve each individual input feature channel for modal difference correction. The common coefficient layer learns all three modalities and performs 1The x 1 convolution weights and sums the corrected output feature channels, facilitating cross-modal sharing semantics. The decomposed convolution network takes visible light, infrared and mixed mode images in a mode self-adaptive mixed module as input, and effectively processes larger mode difference of a characteristic level so as to learn mode invariant characteristics;
step 2.3, batch processing data
Figure BDA00035124501300000710
In the input feature extraction network, firstly, two modal base layers alpha corresponding to a convolution kernel are firstly passed through in the first three stagesrgbirAfter the convolution processing, the intermediate feature set is output in the third stage after all the convolution kernels are processed through the convolution processing of psi of a shared coefficient layer corresponding to the corresponding convolution kernel
Figure BDA0003512450130000081
Wherein the content of the first and second substances,
Figure BDA0003512450130000082
representing an intermediate feature of the jth visible image,
Figure BDA0003512450130000083
representing an intermediate feature of the jth infrared light image;
step 2.4, constructing an adaptive mixing module consisting of an actor network and a critic network, wherein the actor network and the critic network both comprise: one layer of rolled layer, one layer of pooling layer and two layers of full-connected layers;
dynamic and local linear interpolation across different regions of a modal image is learned in a data-driven manner, and the interpolation can be expressed as a single-step Markov decision process and is realized by an operator-critical agent under a deep Reinforcement Learning (RL) framework.
Intermediate feature set
Figure BDA0003512450130000084
Inputting into operator network for processing and outputting mixing ratio
Figure BDA0003512450130000085
Wherein the content of the first and second substances,
Figure BDA0003512450130000086
representing six mixing ratios correspondingly generated by jth data of batch processing data;
the j th visible light image
Figure BDA0003512450130000087
And infrared image
Figure BDA0003512450130000088
Respectively and averagely divided into 6 blocks in the vertical direction, and the visible light pedestrian images and the infrared pedestrian images after being divided are proportionally mixed by utilizing a mixing proportion, so that p multiplied by k mixed modal images are obtained
Figure BDA0003512450130000089
Figure BDA00035124501300000810
Representing the jth mixed-modality image, yjRepresents its pedestrian identity ID;
the blend ratio is dynamically adjusted according to modal and appearance differences between corresponding local regions of the visible and infrared images, which are output by the operator network. The actor network in the agent is used to estimate the blend ratio and the critic network in the agent predicts the state action value (Q value).
Step three, updating the feature extraction network for pedestrian re-identification loss:
step 3.1, three-mode data
Figure BDA00035124501300000811
Inputting the data into a feature extraction network, and processing the data in the first three stages to obtain intermediate features
Figure BDA00035124501300000812
Figure BDA00035124501300000813
Representing the intermediate characteristic of the jth mixed image, processing the intermediate characteristic in the last two stages, and finally outputting the pedestrian characteristic
Figure BDA00035124501300000814
Figure BDA00035124501300000815
Representing the pedestrian characteristic of the jth visible light image,
Figure BDA00035124501300000816
representing the pedestrian characteristics of the jth infrared image,
Figure BDA00035124501300000817
pedestrian features representing a jth mixed-modality image;
pedestrian features
Figure BDA00035124501300000818
After classification processing of a full connection layer, the output result is subjected to a softmax function to obtain the classification probability of the corresponding pedestrian identity
Figure BDA00035124501300000819
Figure BDA00035124501300000820
Classification of jth visible image output in batch data as pedestrian identity ID of mpThe probability of (a) of (b) being,
Figure BDA00035124501300000821
classification of jth infrared image output in batch data into pedestrian identity ID of mpThe probability of (a) of (b) being,
Figure BDA00035124501300000822
classification of mixed modality image output of jth in batch data into pedestrian identity ID of mpThe probability of (d);
step 3.2,Using the identity loss function L of equation (1)id
Figure BDA0003512450130000091
In the formula (1), yjRepresenting the correct pedestrian identity ID of the jth visible light image in the batch data, which is also the correct ID of the jth infrared image and the correct pedestrian identity ID of the jth mixed mode image;
Figure BDA0003512450130000092
and
Figure BDA0003512450130000093
respectively representing the jth visible image, the infrared image and the mixed-mode image in the batch data are classified as having a correct pedestrian identity ID of yjThe probability of (d);
step 3.3, constructing a center triplet loss function of the visible light mode and the infrared mode by using the formula (3), the formula (4) and the formula (5)
Figure BDA0003512450130000094
Center triplet loss function for visible light modality and mixed modality
Figure BDA0003512450130000095
Center triplet loss function for infrared and mixed modalities
Figure BDA0003512450130000096
Figure BDA0003512450130000097
Figure BDA0003512450130000098
Figure BDA0003512450130000099
In the formulae (3), (4) and (5),
Figure BDA00035124501300000910
respectively represent m < th > in the batch datapPedestrian feature centers of visible light images of individual pedestrians, pedestrian feature centers of infrared images of m _ p pedestrians, and pedestrian feature centers of mixed-mode images of m _ p pedestrians, ρ being an edge distance parameter [. cndot. ]]+Max (, 0) denotes a max function;
Figure BDA00035124501300000911
indicating nth in batch datapA pedestrian feature center of an infrared image of an individual pedestrian or a pedestrian feature center of a mixed-modality image,
Figure BDA00035124501300000912
indicating nth in batch datapA pedestrian feature center of a visible light image of an individual pedestrian or a pedestrian feature center of an infrared image or a pedestrian feature center of a mixed modality image.
Construction of network total loss function L by using formula (4)dcn
Figure BDA0003512450130000101
The cross-modal loss function can better utilize the advantages of a mixed mode as an auxiliary mode and bridge the characteristic difference between the modes.
3.4, training the feature extraction network based on the training data set by using an Adam optimization strategy until the total loss function L of the networkdcnUntil convergence, obtaining an optimal feature extraction network;
step four, updating the self-adaptive mixing module by reinforcement learning loss:
step 4.1, constructing a reward R by using the formula (4) and the formula (5):
Figure BDA0003512450130000102
Figure BDA0003512450130000103
in the formulae (4) and (5),
Figure BDA0003512450130000104
indicates the reward, the mAP indicates the average precision index, and rank-K indicates the accuracy index of the search result classification of K before ranking, in this embodiment, K is 5, and S is based on
Figure BDA0003512450130000105
A calculated similarity matrix; epsilon (S) represents the comprehensive index of the similarity matrix S; srgb,irTo represent
Figure BDA0003512450130000106
A similarity matrix calculated therebetween, Smix,irTo represent
Figure BDA0003512450130000107
A similarity matrix calculated therebetween, Sir,rgbTo represent
Figure BDA0003512450130000108
A similarity matrix calculated therebetween, Smix,rgbTo represent
Figure BDA0003512450130000109
A similarity matrix calculated between;
step 4.2, constructing loss functions of the updated operator network by using the formula (6) and the formula (7) respectively
Figure BDA00035124501300001010
And loss function of critic network
Figure BDA00035124501300001011
Figure BDA00035124501300001012
Figure BDA00035124501300001013
In the formulae (6) and (7),
Figure BDA00035124501300001014
representing the output of the operator network,
Figure BDA00035124501300001015
representing a critical network output, | X | | | non-conducting phosphor2Represents a squared error function;
4.3, alternately updating and training the actor network and the critic network of the self-adaptive mixed module network based on the training data set by using an Adam optimization strategy until the loss function
Figure BDA00035124501300001016
And
Figure BDA00035124501300001017
until convergence, obtaining an optimal self-adaptive mixed module network;
step five, search process
Step 5.1, respectively extracting pedestrian features of query library by utilizing optimal feature extraction network
Figure BDA0003512450130000111
And pedestrian characteristics of the queried repository
Figure BDA0003512450130000112
Wherein the content of the first and second substances,
Figure BDA0003512450130000113
representing the q-th query image, NqIndicates the number of the query images to be displayed,
Figure BDA0003512450130000114
representing the g-th image, N, in the queried repositorygRepresenting the number of images of the queried library; in this example, Nq=Ng=2060。
Step 5.2, under the setting of searching infrared pedestrian images by visible light pedestrian images, making the images of the query library be visible light images and the images of the queried library be infrared images;
according to pedestrian characteristics
Figure BDA0003512450130000115
And
Figure BDA0003512450130000116
and calculating a similarity matrix, and sequencing the similarity matrix line by line to obtain a final retrieval result.

Claims (1)

1. A pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition is characterized by comprising the following steps:
step one, pedestrian data collection and preprocessing:
respectively collecting infrared and visible light monitoring videos of pedestrians by using an infrared camera and an optical camera, and performing pedestrian detection and size normalization pretreatment on the videos frame by frame to obtain an infrared pedestrian image set
Figure FDA0003512450120000011
And visible pedestrian image set
Figure FDA0003512450120000012
Wherein the content of the first and second substances,
Figure FDA0003512450120000013
representing the ith infrared pedestrian image,
Figure FDA0003512450120000014
representing the ith visible pedestrian image;
is the ith infrared pedestrian map
Figure FDA0003512450120000015
And ith visible pedestrian image
Figure FDA0003512450120000016
Respectively setting an ith pedestrian identity ID; is marked as yiAnd is and
Figure FDA0003512450120000017
Figure FDA0003512450120000018
is the number of identities in the training set, mpRepresents any pedestrian identity ID; thereby constructing a training data set matched with infrared and visible light
Figure FDA0003512450120000019
Wherein N represents the number of images in the training dataset;
step two, self-adaptive image mixing:
2.1, acquiring infrared pedestrian images and visible light pedestrian images of p pedestrian identities ID from the training data set each time, and acquiring k infrared pedestrian images and k visible light pedestrian images by each pedestrian identity ID, thereby obtaining batch processing data consisting of 2 Xpxk images
Figure FDA00035124501200000110
Wherein the content of the first and second substances,
Figure FDA00035124501200000111
representing the jth visible pedestrian image in the batch data,
Figure FDA00035124501200000112
representing the j-th infrared pedestrian image, y, in the batch datajThe pedestrian ID representing the jth image in the batch data.
2.2, constructing a feature extraction network based on the ResNet-50 deep learning network;
the ResNet-50 deep learning network comprises 5 stages, wherein the 1 st Stage 0 is n by one layer of convolution kernel1×n1The convolution layer of (1), a batch normalization processing layer, a ReLU activation function layer, and the rest 4 stages are all composed of Bottleneck modules; stage 2 contains 3 Bottleneck modules, and the rest 3 stages respectively comprise 4, 6 and 3 Bottleneck modules, wherein each Bottleneck module is n by one layer of convolution kernel2×n2One convolution kernel of n3×n3Convolution layer, one layer of convolution kernel is n2×n2The composition of the convolution layer;
performing modal adaptive decomposition on all convolution kernels in the first three stages of the ResNet-50 deep learning network to obtain three modal base layers alpha corresponding to each convolution kernelrgbirmixAnd a mode sharing coefficient layer psi, and then the mode sharing coefficient layer psi and the rest two stages form the feature extraction network together;
step 2.3, the batch processing data
Figure FDA00035124501200000113
Inputting the two modal base layers alpha corresponding to the two modal base layers alpha passing through a convolution kernel in the first three stages in the feature extraction networkrgbirAfter the convolution processing, the intermediate feature set is output in the third stage after all the convolution kernels are processed through the convolution processing of psi of a shared coefficient layer corresponding to the corresponding convolution kernel
Figure FDA0003512450120000021
Wherein the content of the first and second substances,
Figure FDA0003512450120000022
representing an intermediate feature of the jth visible image,
Figure FDA0003512450120000023
representing an intermediate feature of the jth infrared light image;
step 2.4, constructing an adaptive mixing module consisting of an actor network and a critic network, wherein the actor network and the critic network both comprise: a rolling layer, a pooling layer and two full-connecting layers;
the intermediate feature set
Figure FDA0003512450120000024
Inputting into the operator network for processing and outputting the mixing ratio
Figure FDA0003512450120000025
Wherein the content of the first and second substances,
Figure FDA0003512450120000026
representing six mixing ratios correspondingly generated by jth data of batch processing data;
the j th visible light image
Figure FDA0003512450120000027
And infrared image
Figure FDA0003512450120000028
Respectively averagely dividing the images into 6 blocks in the vertical direction, and proportionally mixing the visible light pedestrian images and the infrared pedestrian images after the blocks by using the mixing proportion to obtain p x k mixed modal images
Figure FDA0003512450120000029
Figure FDA00035124501200000210
Representing the jth mixed modality image, yjRepresents its pedestrian identity ID;
step three, updating the feature extraction network for pedestrian re-identification loss:
step 3.1, three-mode data
Figure FDA00035124501200000211
Inputting the data into the feature extraction network, and processing the data in the first three stages to obtain intermediate features
Figure FDA00035124501200000212
Figure FDA00035124501200000213
Representing the intermediate feature of the jth mixed image, and finally outputting the pedestrian feature after the processing of the last two stages
Figure FDA00035124501200000214
Figure FDA00035124501200000215
Representing the pedestrian characteristic of the jth visible light image,
Figure FDA00035124501200000216
representing the pedestrian characteristics of the jth infrared image,
Figure FDA00035124501200000217
pedestrian features representing a jth mixed-modality image;
pedestrian features
Figure FDA00035124501200000218
After classification processing of a full connection layer, the output result is subjected to a softmax function to obtain the classification probability of the corresponding pedestrian identity
Figure FDA00035124501200000219
Figure FDA00035124501200000220
Classification of jth visible image output in batch data as pedestrian identity ID of mpThe probability of (a) of (b) being,
Figure FDA00035124501200000221
classification of jth infrared image output in batch data into pedestrian identity ID of mpThe probability of (a) of (b) being,
Figure FDA00035124501200000222
classification of mixed modality image output of jth in batch data into pedestrian identity ID of mpThe probability of (d);
step 3.2, identity loss function L using formula (1)id
Figure FDA00035124501200000223
In the formula (1), yjRepresenting the correct pedestrian identity ID of the jth visible light image in the batch data, which is also the correct ID of the jth infrared image and the correct pedestrian identity ID of the jth mixed mode image;
Figure FDA00035124501200000224
and
Figure FDA00035124501200000225
respectively representing the jth visible image, the infrared image and the mixed-mode image in the batch data are classified as having a correct pedestrian identity ID of yjThe probability of (d);
step 3.3, constructing a center triplet loss function of the visible light mode and the infrared mode by using the formula (3), the formula (4) and the formula (5)
Figure FDA0003512450120000031
Center triplet loss function for visible light modality and mixed modality
Figure FDA0003512450120000032
Center triplet loss content for infrared and mixed modalitiesNumber of
Figure FDA0003512450120000033
Figure FDA0003512450120000034
Figure FDA0003512450120000035
Figure FDA0003512450120000036
In the formulae (3), (4) and (5),
Figure FDA0003512450120000037
respectively represent m < th > in the batch datapPedestrian feature center, mth of visible light image of individual pedestrianpPedestrian feature center and m-th of infrared image of individual pedestrianpPedestrian feature center of mixed-mode image of individual pedestrian, ρ is edge distance parameter [. dot. ]]+Max (, 0) denotes a max function;
Figure FDA0003512450120000038
indicating mth in batch datapA pedestrian feature center of an infrared image of an individual pedestrian or a pedestrian feature center of a mixed-modality image,
Figure FDA0003512450120000039
indicating nth in batch datapA pedestrian feature center of a visible light image or a pedestrian feature center of an infrared image of the individual pedestrian or a pedestrian feature center of a mixed modality image;
construction of network total loss function L by using formula (4)dcn
Figure FDA00035124501200000310
3.4, training the feature extraction network by using an Adam optimization strategy based on the training data set until a network total loss function LdcnUntil convergence, obtaining an optimal feature extraction network;
step four, updating the self-adaptive mixing module by reinforcement learning loss:
step 4.1, constructing a reward R by using the formula (4) and the formula (5):
Figure FDA0003512450120000041
Figure FDA0003512450120000042
in the formula (4) and the formula (5), R represents reward, mAP (. + -.) represents average accuracy index, rank-k (. + -.) represents accuracy index of search result classification of k before ranking, and S is based on
Figure FDA0003512450120000043
A calculated similarity matrix; epsilon (S) represents the comprehensive index of the similarity matrix S; srgb,irTo represent
Figure FDA0003512450120000044
Similarity matrix calculated therebetween, Smix,irTo represent
Figure FDA0003512450120000045
A similarity matrix calculated therebetween, Sir,rgbTo represent
Figure FDA0003512450120000046
A similarity matrix calculated therebetween, Smix ,rgbRepresent
Figure FDA0003512450120000047
A similarity matrix calculated between;
step 4.2, constructing loss functions of the updated operator network by using the formula (6) and the formula (7) respectively
Figure FDA0003512450120000048
And loss function of critic network
Figure FDA0003512450120000049
Figure FDA00035124501200000410
Figure FDA00035124501200000411
In the formulae (6) and (7),
Figure FDA00035124501200000412
on behalf of the output of the operator's network,
Figure FDA00035124501200000413
representing a critical network output, | X | | | non-conducting phosphor2Represents a squared error function;
and 4.3, alternately updating and training the actor network and the critic network of the self-adaptive hybrid module network by using an Adam optimization strategy based on the training data set until a loss function
Figure FDA00035124501200000414
And
Figure FDA00035124501200000415
until convergence, obtaining an optimal self-adaptive mixed module network;
step five, a retrieval process;
step 5.1, respectively extracting pedestrian features of query library by utilizing optimal feature extraction network
Figure FDA00035124501200000416
And pedestrian characteristics of the queried repository
Figure FDA00035124501200000417
Wherein the content of the first and second substances,
Figure FDA00035124501200000418
representing the q-th query image, NqIndicating the number of query images to be presented,
Figure FDA00035124501200000419
representing the g-th image, N, in the queried repositorygRepresenting the number of images of the queried library;
step 5.2, under the setting of searching infrared pedestrian images by visible light pedestrian images, making the images of the query library be visible light images and the images of the queried library be infrared images;
according to pedestrian characteristics
Figure FDA0003512450120000051
And
Figure FDA0003512450120000052
and calculating a similarity matrix, and sequencing the similarity matrix line by line to obtain a final retrieval result.
CN202210155715.9A 2022-02-21 2022-02-21 Pedestrian re-identification method based on modal self-adaptive mixing and invariance convolution decomposition Active CN114550210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210155715.9A CN114550210B (en) 2022-02-21 2022-02-21 Pedestrian re-identification method based on modal self-adaptive mixing and invariance convolution decomposition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210155715.9A CN114550210B (en) 2022-02-21 2022-02-21 Pedestrian re-identification method based on modal self-adaptive mixing and invariance convolution decomposition

Publications (2)

Publication Number Publication Date
CN114550210A true CN114550210A (en) 2022-05-27
CN114550210B CN114550210B (en) 2024-04-02

Family

ID=81675054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210155715.9A Active CN114550210B (en) 2022-02-21 2022-02-21 Pedestrian re-identification method based on modal self-adaptive mixing and invariance convolution decomposition

Country Status (1)

Country Link
CN (1) CN114550210B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117542084A (en) * 2023-12-06 2024-02-09 湖南大学 Cross-modal pedestrian re-recognition method based on semantic perception

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434654A (en) * 2020-12-07 2021-03-02 安徽大学 Cross-modal pedestrian re-identification method based on symmetric convolutional neural network
CN113989851A (en) * 2021-11-10 2022-01-28 合肥工业大学 Cross-modal pedestrian re-identification method based on heterogeneous fusion graph convolution network
WO2022027986A1 (en) * 2020-08-04 2022-02-10 杰创智能科技股份有限公司 Cross-modal person re-identification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022027986A1 (en) * 2020-08-04 2022-02-10 杰创智能科技股份有限公司 Cross-modal person re-identification method and device
CN112434654A (en) * 2020-12-07 2021-03-02 安徽大学 Cross-modal pedestrian re-identification method based on symmetric convolutional neural network
CN113989851A (en) * 2021-11-10 2022-01-28 合肥工业大学 Cross-modal pedestrian re-identification method based on heterogeneous fusion graph convolution network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
冯敏;张智成;吕进;余磊;韩斌;: "基于生成对抗网络的跨模态行人重识别研究", 现代信息科技, no. 04, 29 February 2020 (2020-02-29) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117542084A (en) * 2023-12-06 2024-02-09 湖南大学 Cross-modal pedestrian re-recognition method based on semantic perception

Also Published As

Publication number Publication date
CN114550210B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN111368896B (en) Hyperspectral remote sensing image classification method based on dense residual three-dimensional convolutional neural network
CN108960140B (en) Pedestrian re-identification method based on multi-region feature extraction and fusion
CN107038448B (en) Target detection model construction method
CN114220124A (en) Near-infrared-visible light cross-modal double-flow pedestrian re-identification method and system
CN109684922B (en) Multi-model finished dish identification method based on convolutional neural network
CN112347861B (en) Human body posture estimation method based on motion feature constraint
CN112395442B (en) Automatic identification and content filtering method for popular pictures on mobile internet
CN112651262B (en) Cross-modal pedestrian re-identification method based on self-adaptive pedestrian alignment
CN111783831A (en) Complex image accurate classification method based on multi-source multi-label shared subspace learning
CN110516533B (en) Pedestrian re-identification method based on depth measurement
CN115410088B (en) Hyperspectral image field self-adaption method based on virtual classifier
CN113283362A (en) Cross-modal pedestrian re-identification method
CN115248876B (en) Remote sensing image overall recommendation method based on content understanding
CN112749675A (en) Potato disease identification method based on convolutional neural network
CN111523586B (en) Noise-aware-based full-network supervision target detection method
CN116740763A (en) Cross-mode pedestrian re-identification method based on dual-attention perception fusion network
CN114550210A (en) Pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition
CN113361370B (en) Abnormal behavior detection method based on deep learning
CN110852292A (en) Sketch face recognition method based on cross-modal multi-task depth measurement learning
CN107423771B (en) Two-time-phase remote sensing image change detection method
CN117333948A (en) End-to-end multi-target broiler behavior identification method integrating space-time attention mechanism
CN117218446A (en) Solid waste sorting method and system based on RGB-MSI feature fusion
CN116681742A (en) Visible light and infrared thermal imaging image registration method based on graph neural network
CN116797799A (en) Single-target tracking method and tracking system based on channel attention and space-time perception
CN116597177A (en) Multi-source image block matching method based on dual-branch parallel depth interaction cooperation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant