CN114550210A - Pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition - Google Patents
Pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition Download PDFInfo
- Publication number
- CN114550210A CN114550210A CN202210155715.9A CN202210155715A CN114550210A CN 114550210 A CN114550210 A CN 114550210A CN 202210155715 A CN202210155715 A CN 202210155715A CN 114550210 A CN114550210 A CN 114550210A
- Authority
- CN
- China
- Prior art keywords
- pedestrian
- image
- infrared
- network
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000002156 mixing Methods 0.000 title claims abstract description 33
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000003044 adaptive effect Effects 0.000 title claims abstract description 14
- 238000000354 decomposition reaction Methods 0.000 title claims abstract description 14
- 238000000605 extraction Methods 0.000 claims abstract description 26
- 239000011159 matrix material Substances 0.000 claims abstract description 25
- 230000002787 reinforcement Effects 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 36
- 238000012545 processing Methods 0.000 claims description 31
- 238000012549 training Methods 0.000 claims description 23
- 239000000203 mixture Substances 0.000 claims description 15
- 239000000126 substance Substances 0.000 claims description 14
- 238000013135 deep learning Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 8
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 6
- OAICVXFJPJFONN-UHFFFAOYSA-N Phosphorus Chemical compound [P] OAICVXFJPJFONN-UHFFFAOYSA-N 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 238000013480 data collection Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 3
- 230000003287 optical effect Effects 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000005096 rolling process Methods 0.000 claims description 2
- 238000006243 chemical reaction Methods 0.000 abstract description 4
- 238000004364 calculation method Methods 0.000 abstract description 3
- 238000013459 approach Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000000116 mitigating effect Effects 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition, which comprises the following steps: 1. constructing a self-adaptive mixing module to output a mixing proportion to mix the visible light image and the infrared image to obtain a mixed modal image; 2. inputting the visible light image, the infrared image and the mixed mode image into a feature extraction network, and calculating classification loss and triple loss to update the feature extraction network; 3. constructing an incentive R, and constructing losses of an actor network and a critic network by using a reinforcement learning rule so as to update the network; 4. and calculating a similarity matrix according to the pedestrian characteristics of the search library and the searched library, and obtaining a search result. The method can solve the problems of difficulty and calculation consumption of the traditional generation confrontation model in infrared-visible light image conversion and information loss and difficult fitting of the traditional single-flow double-flow network, so that the pedestrian images in the visible light mode and the infrared mode can be matched more efficiently and accurately.
Description
Technical Field
The invention belongs to the field of pedestrian re-identification, and particularly relates to a pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition.
Background
Pedestrian Re-identification (Re-ID) has recently attracted increasing attention due to its wide application in automated tracking and activity analysis. It is intended to capture and identify a target pedestrian from a plurality of different camera views. Pedestrian re-identification is very challenging due to background clutter, occlusion, drastic changes in lighting, differences in body posture, and the like. Most existing pedestrian re-identification methods focus primarily on pedestrian visible light images from visible light cameras and represent the task as a single-modality (visible-visible) matching problem. In recent years, they have made remarkable progress. However, visible light cameras cannot provide useful appearance information in poor lighting environments (such as at night), which limits the applicability of pedestrian re-identification in practical scenes. To address this problem, recent surveillance systems have begun to incorporate infrared cameras to facilitate night surveillance, which presents a new cross-modal matching task called visible-infrared pedestrian re-identification. Given a visible (or infrared) image of the target person, visible-infrared pedestrian re-identification aims to find a corresponding infrared (or visible) image of the same person captured by other spectral cameras. In addition to appearance differences, visible-infrared pedestrian re-identification has significant modal differences, which result from different imaging processes between different spectral cameras (visible and infrared images are heterogeneous in nature, with different wavelength ranges), compared to traditional single-modal pedestrian re-identification. The key solution of visible light-infrared pedestrian re-identification is to close large modal gaps and learn modal-independent discrimination features from visible light and infrared images.
Existing visible light-infrared pedestrian re-identification methods mainly focus on mitigating inherent modal differences at the pixel level or the feature level to extract cross-modal shared features. To mitigate modal differences at the pixel level, these approaches typically design complex generation countermeasure models to perform image-to-image conversions and generate generation samples that are difficult to optimize and noisy. On the other hand, to reduce the modal differences at the feature level, these methods use single-stream or dual-stream networks to extract modality-invariant features through several custom penalties. However, the single-stream network-based approach learns a general network model that lacks the ability to explicitly model individual modalities and ignores the modality-specific features, resulting in critical information loss. A dual-stream network-based approach first abstracts modality-specific information using a separate branching layer for each modality, and then projects modality-specific features into a uniform feature space using a shared network. They completely separate the modeling processes of modality-specific and modality-shared information, and may break important cross-modality shared semantics when extracting modality-specific features. Furthermore, all of the above methods attempt to directly handle such large modal differences and align the two modalities that are parameter sensitive and difficult to converge.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a pedestrian re-identification method based on modal adaptive mixing and invariant convolution decomposition, so that the problems of difficulty and calculation consumption of a traditional generation confrontation model in infrared-visible light image conversion and information loss and difficulty in fitting of a traditional single-flow double-flow network can be solved, and therefore pedestrian images in a visible light mode and an infrared mode can be matched more efficiently and accurately.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention discloses a pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition, which is characterized by comprising the following steps of:
step one, pedestrian data collection and preprocessing:
respectively collecting infrared and visible light monitoring videos of pedestrians by using an infrared camera and an optical camera, and performing pedestrian detection and size normalization pretreatment on the videos frame by frame to obtain an infrared pedestrian image setAnd visible pedestrian image setWherein the content of the first and second substances,representing the ith infrared pedestrian image,representing the ith visible pedestrian image;
is the ith infrared pedestrian mapAnd ith visible pedestrian imageRespectively setting an ith pedestrian identity ID; is marked as yiAnd is and is the number of identities in the training set, mpRepresents any pedestrian identity ID; thereby constructing a training data set matched with infrared and visible lightWherein N represents an image in the training datasetThe number of the particles;
step two, self-adaptive image mixing:
2.1, acquiring infrared pedestrian images and visible light pedestrian images of p pedestrian identities ID from the training data set each time, and acquiring k infrared pedestrian images and k visible light pedestrian images by each pedestrian identity ID, thereby obtaining batch processing data formed by 2 XpXk imagesWherein the content of the first and second substances,representing the jth visible pedestrian image in the batch data,representing the j-th infrared pedestrian image, y, in the batch datajThe pedestrian ID representing the jth image in the batch data.
2.2, constructing a feature extraction network based on the ResNet-50 deep learning network;
the ResNet-50 deep learning network comprises 5 stages, wherein the 1 st Stage 0 is n by one layer of convolution kernel1×n1The convolution layer of (1), a batch normalization processing layer, a ReLU activation function layer, and the rest 4 stages are all composed of Bottleneck modules; stage 2 contains 3 Bottleneck modules, and the rest 3 stages respectively comprise 4, 6 and 3 Bottleneck modules, wherein each Bottleneck module is n by one layer of convolution kernel2×n2One layer of convolution kernel is n3×n3Convolution layer, one layer of convolution kernel is n2×n2The composition of the convolution layer;
performing modal adaptive decomposition on all convolution kernels in the first three stages of the ResNet-50 deep learning network to obtain three modal base layers alpha corresponding to each convolution kernelrgb,αir,αmixAnd a mode sharing coefficient layer psi, and then the mode sharing coefficient layer psi and the rest two stages form the feature extraction network together;
step 2.3, the batch processing dataInputting the two modal base layers alpha corresponding to the two modal base layers alpha passing through a convolution kernel in the first three stages in the feature extraction networkrgb,αirAfter the convolution processing, the intermediate feature set is output in the third stage after all the convolution kernels are processed through the convolution processing of psi of a shared coefficient layer corresponding to the corresponding convolution kernelWherein the content of the first and second substances,representing an intermediate feature of the jth visible image,representing an intermediate feature of the jth infrared light image;
step 2.4, constructing an adaptive mixing module consisting of an operator network and a critic network, wherein the operator network and the critic network both comprise: a rolling layer, a pooling layer and two full-connecting layers;
the set of intermediate featuresInputting into the operator network for processing and outputting the mixing ratioWherein the content of the first and second substances,representing six mixing ratios correspondingly generated by jth data of batch processing data;
the j th visible light imageAnd infrared imageRespectively averagely dividing the images into 6 blocks in the vertical direction, and proportionally mixing the visible light pedestrian images and the infrared pedestrian images after being divided by using the mixing proportion to obtain p multiplied by k mixed modal images Representing the jth mixed modality image, yjRepresents its pedestrian identity ID;
step three, updating the feature extraction network for pedestrian re-identification loss:
step 3.1, three-mode dataInputting the data into the feature extraction network, and processing the data in the first three stages to obtain intermediate features Representing the intermediate feature of the jth mixed image, and finally outputting the pedestrian feature after the processing of the last two stages Representing the pedestrian characteristic of the jth visible light image,representing the pedestrian characteristics of the jth infrared image,denotes the jth mixing diePedestrian features of the state image;
pedestrian featuresAfter classification processing of a full connection layer, the output result is subjected to a softmax function to obtain the classification probability of the corresponding pedestrian identity Classification of jth visible image output in batch data as pedestrian identity ID of mpThe probability of (a) of (b) being,classification of jth infrared image output in batch data into pedestrian identity ID of mpThe probability of (a) of (b) being,classification of mixed modality image output of jth in batch data into pedestrian identity ID of mpThe probability of (d);
step 3.2, identity loss function L using formula (1)id:
In the formula (1), yjRepresenting the correct pedestrian identity ID of the jth visible light image in the batch data, which is also the correct ID of the jth infrared image and the correct pedestrian identity ID of the jth mixed mode image;andrespectively representing the jth visible light image and the infrared image in the batch processing dataAnd the output of the mixed modality image is classified as the correct pedestrian identity ID of yjThe probability of (d);
step 3.3, constructing a center triplet loss function of the visible light mode and the infrared mode by using the formula (3), the formula (4) and the formula (5)Center triplet loss function for visible light modality and mixed modalityCenter triplet loss function for infrared and mixed modalities
In the formulae (3), (4) and (5),respectively represent m < th > in the batch datapPedestrian feature center, mth of visible light image of individual pedestrianpPedestrian feature center and mth of infrared image of individual pedestrianpPedestrian feature centers of mixed-mode images of individual pedestrians, ρ being an edge distance parameter [. sup. ]]+Max (, 0) denotes a max function;indicating nth in batch datapPedestrian feature center or mixture of infrared images of individual pedestriansThe center of the pedestrian feature of the resultant modal image,indicating nth in batch datapA pedestrian feature center of a visible light image or a pedestrian feature center of an infrared image of the individual pedestrian or a pedestrian feature center of a mixed modality image;
construction of network total loss function L by using formula (4)dcn:
3.4, training the feature extraction network by using an Adam optimization strategy based on the training data set until a network total loss function LdcnUntil convergence, obtaining an optimal feature extraction network;
step four, updating the self-adaptive mixing module by reinforcement learning loss:
step 4.1, constructing a reward R by using the formula (4) and the formula (5):
in the formulae (4) and (5),representing reward, mAP (star) representing average accuracy index, rank-k (star) representing accuracy index of search result classification of k before ranking, and S is according toA calculated similarity matrix; epsilon (S) represents the comprehensive index of the similarity matrix S; srgb,irTo representSimilarity matrix calculated therebetween, Smix,irRepresentA similarity matrix calculated therebetween, Sir,rgbTo representA similarity matrix calculated therebetween, Smix ,rgbTo representSimilarity matrix calculated between them;
step 4.2, constructing loss functions of the updated operator network by using the formula (6) and the formula (7) respectivelyAnd loss function of critic network
In the formulae (6) and (7),on behalf of the output of the operator's network,representing a critical network output, | X | | | non-conducting phosphor2Represents a squared error function;
step 4.3, based on the training data set, using Adam optimization strategy to carry out actor network and critic network of the self-adaptive hybrid module networkTraining is alternately updated until the loss functionAnduntil convergence, obtaining an optimal self-adaptive mixed module network;
step five, a retrieval process;
step 5.1, respectively extracting pedestrian features of query library by utilizing optimal feature extraction networkAnd pedestrian characteristics of the queried repositoryWherein, the first and the second end of the pipe are connected with each other,representing the q-th query image, NqIndicating the number of query images to be presented,representing the g-th image, N, in the queried librarygRepresenting the number of images of the queried library;
step 5.2, under the setting of searching infrared pedestrian images by visible light pedestrian images, making the images of the query library be visible light images and the images of the queried library be infrared images;
according to pedestrian characteristicsAndand calculating a similarity matrix, and sequencing the similarity matrix line by line to obtain a final retrieval result.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the method, a mixed mode obtained by self-adaptive mixing is used as an auxiliary mode, the mixed mode is combined with the original infrared and visible light modes to design a three-mode cross-mode pedestrian re-identification solution, and through a mode of convolution decomposition, each convolution decomposes two parts of a mode base and a shared coefficient, so that the mode characteristics and cross-mode invariant characteristics of infrared and mixed mode and visible light mode pedestrians are more fully extracted, and therefore the accuracy of infrared and visible light cross-mode pedestrian retrieval and identification is improved.
2. The invention uses the self-adaptive mixing module to mix the light mode and the infrared mode images to obtain the mixed mode as the auxiliary mode, thus avoiding the difficulty and the calculation consumption in the image conversion of the traditional generation method, ensuring that the obtained auxiliary mode is more reliable and efficient, and improving the accuracy of infrared and visible light cross-mode pedestrian retrieval and identification.
3. The invention uses the convolution decomposition network, not only can process the modal characteristics, but also can fuse the cross-modal invariant characteristics, and has small parameter quantity, thereby solving the problems of information loss and difficult fitting of a multi-path network of the traditional single-flow double-flow network, obtaining more reliable pedestrian characteristics and improving the accuracy of infrared and visible light cross-modal pedestrian retrieval and identification.
Drawings
FIG. 1 is a general flow diagram of the present invention.
Detailed Description
In this embodiment, a flow of a pedestrian re-identification method based on modal adaptive mixing and invariant convolutional decomposition refers to fig. 1, and specifically, the method is performed according to the following steps:
step one, pedestrian data collection and pretreatment:
respectively collecting infrared and visible light monitoring videos of pedestrians by using an infrared camera and an optical camera, and performing pedestrian detection and size normalization pretreatment on the videos frame by frame to obtain an infrared pedestrian image setAnd visible pedestrian image setWherein the content of the first and second substances,representing the ith infrared pedestrian image,representing the ith visible pedestrian image;
is the ith infrared pedestrian mapAnd ith visible pedestrian imageRespectively setting an ith pedestrian identity ID; is marked as yiAnd is and is the number of identities in the training set, mpRepresents any pedestrian identity ID; thereby constructing a training data set matched with infrared and visible lightWherein N represents the number of images in the training dataset; in this embodiment, N is 2060,
step two, self-adaptive image mixing:
2.1, acquiring infrared pedestrian images and visible light pedestrian images of p pedestrian identities ID from the training data set each time, and acquiring k infrared pedestrian images and k visible light pedestrian images by each pedestrian identity ID, thereby obtaining batch processing data formed by 2 XpXk imagesWherein the content of the first and second substances,representing the jth visible pedestrian image in the batch data,representing the j-th infrared pedestrian image, y, in the batch datajThe pedestrian ID representing the jth image in the batch data. In this embodiment, p is 8 and k is 4.
2.2, constructing a feature extraction network based on the ResNet-50 deep learning network;
the ResNet-50 deep learning network comprises 5 stages, wherein the 1 st Stage 0 is n by one layer of convolution kernel1×n1The convolution layer of (1), a batch normalization processing layer, a ReLU activation function layer, and the rest 4 stages are all composed of Bottleneck modules; stage 2 contains 3 Bottleneck modules, and the rest 3 stages respectively comprise 4, 6 and 3 Bottleneck modules, wherein each Bottleneck module is n by one layer of convolution kernel2×n2One layer of convolution kernel is n3×n3Convolution layer, one layer of convolution kernel is n2×n2The composition of the convolution layer; in this example, n1=7,n2=1,n3=3。
Performing modal adaptive decomposition on all convolution kernels in the first three stages of the ResNet-50 deep learning network to obtain three modal base layers alpha corresponding to each convolution kernelrgb,αir,αmixAnd a modality-specific coefficient layer psi, which together with the other two stages forms a feature extraction network, the modality-adaptive convolution decomposition approximates the convolution kernel as the product of a small set of coefficient layers for simultaneously countering modality differences and performing cross-modality shared semantics at a feature level, the modality-specific base layer being independently learned from the corresponding modality image to model modality changes. They spatially convolve each individual input feature channel for modal difference correction. The common coefficient layer learns all three modalities and performs 1The x 1 convolution weights and sums the corrected output feature channels, facilitating cross-modal sharing semantics. The decomposed convolution network takes visible light, infrared and mixed mode images in a mode self-adaptive mixed module as input, and effectively processes larger mode difference of a characteristic level so as to learn mode invariant characteristics;
step 2.3, batch processing dataIn the input feature extraction network, firstly, two modal base layers alpha corresponding to a convolution kernel are firstly passed through in the first three stagesrgb,αirAfter the convolution processing, the intermediate feature set is output in the third stage after all the convolution kernels are processed through the convolution processing of psi of a shared coefficient layer corresponding to the corresponding convolution kernelWherein the content of the first and second substances,representing an intermediate feature of the jth visible image,representing an intermediate feature of the jth infrared light image;
step 2.4, constructing an adaptive mixing module consisting of an actor network and a critic network, wherein the actor network and the critic network both comprise: one layer of rolled layer, one layer of pooling layer and two layers of full-connected layers;
dynamic and local linear interpolation across different regions of a modal image is learned in a data-driven manner, and the interpolation can be expressed as a single-step Markov decision process and is realized by an operator-critical agent under a deep Reinforcement Learning (RL) framework.
Intermediate feature setInputting into operator network for processing and outputting mixing ratioWherein the content of the first and second substances,representing six mixing ratios correspondingly generated by jth data of batch processing data;
the j th visible light imageAnd infrared imageRespectively and averagely divided into 6 blocks in the vertical direction, and the visible light pedestrian images and the infrared pedestrian images after being divided are proportionally mixed by utilizing a mixing proportion, so that p multiplied by k mixed modal images are obtained Representing the jth mixed-modality image, yjRepresents its pedestrian identity ID;
the blend ratio is dynamically adjusted according to modal and appearance differences between corresponding local regions of the visible and infrared images, which are output by the operator network. The actor network in the agent is used to estimate the blend ratio and the critic network in the agent predicts the state action value (Q value).
Step three, updating the feature extraction network for pedestrian re-identification loss:
step 3.1, three-mode dataInputting the data into a feature extraction network, and processing the data in the first three stages to obtain intermediate features Representing the intermediate characteristic of the jth mixed image, processing the intermediate characteristic in the last two stages, and finally outputting the pedestrian characteristic Representing the pedestrian characteristic of the jth visible light image,representing the pedestrian characteristics of the jth infrared image,pedestrian features representing a jth mixed-modality image;
pedestrian featuresAfter classification processing of a full connection layer, the output result is subjected to a softmax function to obtain the classification probability of the corresponding pedestrian identity Classification of jth visible image output in batch data as pedestrian identity ID of mpThe probability of (a) of (b) being,classification of jth infrared image output in batch data into pedestrian identity ID of mpThe probability of (a) of (b) being,classification of mixed modality image output of jth in batch data into pedestrian identity ID of mpThe probability of (d);
step 3.2,Using the identity loss function L of equation (1)id:
In the formula (1), yjRepresenting the correct pedestrian identity ID of the jth visible light image in the batch data, which is also the correct ID of the jth infrared image and the correct pedestrian identity ID of the jth mixed mode image;andrespectively representing the jth visible image, the infrared image and the mixed-mode image in the batch data are classified as having a correct pedestrian identity ID of yjThe probability of (d);
step 3.3, constructing a center triplet loss function of the visible light mode and the infrared mode by using the formula (3), the formula (4) and the formula (5)Center triplet loss function for visible light modality and mixed modalityCenter triplet loss function for infrared and mixed modalities
In the formulae (3), (4) and (5),respectively represent m < th > in the batch datapPedestrian feature centers of visible light images of individual pedestrians, pedestrian feature centers of infrared images of m _ p pedestrians, and pedestrian feature centers of mixed-mode images of m _ p pedestrians, ρ being an edge distance parameter [. cndot. ]]+Max (, 0) denotes a max function;indicating nth in batch datapA pedestrian feature center of an infrared image of an individual pedestrian or a pedestrian feature center of a mixed-modality image,indicating nth in batch datapA pedestrian feature center of a visible light image of an individual pedestrian or a pedestrian feature center of an infrared image or a pedestrian feature center of a mixed modality image.
Construction of network total loss function L by using formula (4)dcn:
The cross-modal loss function can better utilize the advantages of a mixed mode as an auxiliary mode and bridge the characteristic difference between the modes.
3.4, training the feature extraction network based on the training data set by using an Adam optimization strategy until the total loss function L of the networkdcnUntil convergence, obtaining an optimal feature extraction network;
step four, updating the self-adaptive mixing module by reinforcement learning loss:
step 4.1, constructing a reward R by using the formula (4) and the formula (5):
in the formulae (4) and (5),indicates the reward, the mAP indicates the average precision index, and rank-K indicates the accuracy index of the search result classification of K before ranking, in this embodiment, K is 5, and S is based onA calculated similarity matrix; epsilon (S) represents the comprehensive index of the similarity matrix S; srgb,irTo representA similarity matrix calculated therebetween, Smix,irTo representA similarity matrix calculated therebetween, Sir,rgbTo representA similarity matrix calculated therebetween, Smix,rgbTo representA similarity matrix calculated between;
step 4.2, constructing loss functions of the updated operator network by using the formula (6) and the formula (7) respectivelyAnd loss function of critic network
In the formulae (6) and (7),representing the output of the operator network,representing a critical network output, | X | | | non-conducting phosphor2Represents a squared error function;
4.3, alternately updating and training the actor network and the critic network of the self-adaptive mixed module network based on the training data set by using an Adam optimization strategy until the loss functionAnduntil convergence, obtaining an optimal self-adaptive mixed module network;
step five, search process
Step 5.1, respectively extracting pedestrian features of query library by utilizing optimal feature extraction networkAnd pedestrian characteristics of the queried repositoryWherein the content of the first and second substances,representing the q-th query image, NqIndicates the number of the query images to be displayed,representing the g-th image, N, in the queried repositorygRepresenting the number of images of the queried library; in this example, Nq=Ng=2060。
Step 5.2, under the setting of searching infrared pedestrian images by visible light pedestrian images, making the images of the query library be visible light images and the images of the queried library be infrared images;
Claims (1)
1. A pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition is characterized by comprising the following steps:
step one, pedestrian data collection and preprocessing:
respectively collecting infrared and visible light monitoring videos of pedestrians by using an infrared camera and an optical camera, and performing pedestrian detection and size normalization pretreatment on the videos frame by frame to obtain an infrared pedestrian image setAnd visible pedestrian image setWherein the content of the first and second substances,representing the ith infrared pedestrian image,representing the ith visible pedestrian image;
is the ith infrared pedestrian mapAnd ith visible pedestrian imageRespectively setting an ith pedestrian identity ID; is marked as yiAnd is and is the number of identities in the training set, mpRepresents any pedestrian identity ID; thereby constructing a training data set matched with infrared and visible lightWherein N represents the number of images in the training dataset;
step two, self-adaptive image mixing:
2.1, acquiring infrared pedestrian images and visible light pedestrian images of p pedestrian identities ID from the training data set each time, and acquiring k infrared pedestrian images and k visible light pedestrian images by each pedestrian identity ID, thereby obtaining batch processing data consisting of 2 Xpxk imagesWherein the content of the first and second substances,representing the jth visible pedestrian image in the batch data,representing the j-th infrared pedestrian image, y, in the batch datajThe pedestrian ID representing the jth image in the batch data.
2.2, constructing a feature extraction network based on the ResNet-50 deep learning network;
the ResNet-50 deep learning network comprises 5 stages, wherein the 1 st Stage 0 is n by one layer of convolution kernel1×n1The convolution layer of (1), a batch normalization processing layer, a ReLU activation function layer, and the rest 4 stages are all composed of Bottleneck modules; stage 2 contains 3 Bottleneck modules, and the rest 3 stages respectively comprise 4, 6 and 3 Bottleneck modules, wherein each Bottleneck module is n by one layer of convolution kernel2×n2One convolution kernel of n3×n3Convolution layer, one layer of convolution kernel is n2×n2The composition of the convolution layer;
performing modal adaptive decomposition on all convolution kernels in the first three stages of the ResNet-50 deep learning network to obtain three modal base layers alpha corresponding to each convolution kernelrgb,αir,αmixAnd a mode sharing coefficient layer psi, and then the mode sharing coefficient layer psi and the rest two stages form the feature extraction network together;
step 2.3, the batch processing dataInputting the two modal base layers alpha corresponding to the two modal base layers alpha passing through a convolution kernel in the first three stages in the feature extraction networkrgb,αirAfter the convolution processing, the intermediate feature set is output in the third stage after all the convolution kernels are processed through the convolution processing of psi of a shared coefficient layer corresponding to the corresponding convolution kernelWherein the content of the first and second substances,representing an intermediate feature of the jth visible image,representing an intermediate feature of the jth infrared light image;
step 2.4, constructing an adaptive mixing module consisting of an actor network and a critic network, wherein the actor network and the critic network both comprise: a rolling layer, a pooling layer and two full-connecting layers;
the intermediate feature setInputting into the operator network for processing and outputting the mixing ratioWherein the content of the first and second substances,representing six mixing ratios correspondingly generated by jth data of batch processing data;
the j th visible light imageAnd infrared imageRespectively averagely dividing the images into 6 blocks in the vertical direction, and proportionally mixing the visible light pedestrian images and the infrared pedestrian images after the blocks by using the mixing proportion to obtain p x k mixed modal images Representing the jth mixed modality image, yjRepresents its pedestrian identity ID;
step three, updating the feature extraction network for pedestrian re-identification loss:
step 3.1, three-mode dataInputting the data into the feature extraction network, and processing the data in the first three stages to obtain intermediate features Representing the intermediate feature of the jth mixed image, and finally outputting the pedestrian feature after the processing of the last two stages Representing the pedestrian characteristic of the jth visible light image,representing the pedestrian characteristics of the jth infrared image,pedestrian features representing a jth mixed-modality image;
pedestrian featuresAfter classification processing of a full connection layer, the output result is subjected to a softmax function to obtain the classification probability of the corresponding pedestrian identity Classification of jth visible image output in batch data as pedestrian identity ID of mpThe probability of (a) of (b) being,classification of jth infrared image output in batch data into pedestrian identity ID of mpThe probability of (a) of (b) being,classification of mixed modality image output of jth in batch data into pedestrian identity ID of mpThe probability of (d);
step 3.2, identity loss function L using formula (1)id:
In the formula (1), yjRepresenting the correct pedestrian identity ID of the jth visible light image in the batch data, which is also the correct ID of the jth infrared image and the correct pedestrian identity ID of the jth mixed mode image;andrespectively representing the jth visible image, the infrared image and the mixed-mode image in the batch data are classified as having a correct pedestrian identity ID of yjThe probability of (d);
step 3.3, constructing a center triplet loss function of the visible light mode and the infrared mode by using the formula (3), the formula (4) and the formula (5)Center triplet loss function for visible light modality and mixed modalityCenter triplet loss content for infrared and mixed modalitiesNumber of
In the formulae (3), (4) and (5),respectively represent m < th > in the batch datapPedestrian feature center, mth of visible light image of individual pedestrianpPedestrian feature center and m-th of infrared image of individual pedestrianpPedestrian feature center of mixed-mode image of individual pedestrian, ρ is edge distance parameter [. dot. ]]+Max (, 0) denotes a max function;indicating mth in batch datapA pedestrian feature center of an infrared image of an individual pedestrian or a pedestrian feature center of a mixed-modality image,indicating nth in batch datapA pedestrian feature center of a visible light image or a pedestrian feature center of an infrared image of the individual pedestrian or a pedestrian feature center of a mixed modality image;
construction of network total loss function L by using formula (4)dcn:
3.4, training the feature extraction network by using an Adam optimization strategy based on the training data set until a network total loss function LdcnUntil convergence, obtaining an optimal feature extraction network;
step four, updating the self-adaptive mixing module by reinforcement learning loss:
step 4.1, constructing a reward R by using the formula (4) and the formula (5):
in the formula (4) and the formula (5), R represents reward, mAP (. + -.) represents average accuracy index, rank-k (. + -.) represents accuracy index of search result classification of k before ranking, and S is based onA calculated similarity matrix; epsilon (S) represents the comprehensive index of the similarity matrix S; srgb,irTo representSimilarity matrix calculated therebetween, Smix,irTo representA similarity matrix calculated therebetween, Sir,rgbTo representA similarity matrix calculated therebetween, Smix ,rgbRepresentA similarity matrix calculated between;
step 4.2, constructing loss functions of the updated operator network by using the formula (6) and the formula (7) respectivelyAnd loss function of critic network
In the formulae (6) and (7),on behalf of the output of the operator's network,representing a critical network output, | X | | | non-conducting phosphor2Represents a squared error function;
and 4.3, alternately updating and training the actor network and the critic network of the self-adaptive hybrid module network by using an Adam optimization strategy based on the training data set until a loss functionAnduntil convergence, obtaining an optimal self-adaptive mixed module network;
step five, a retrieval process;
step 5.1, respectively extracting pedestrian features of query library by utilizing optimal feature extraction networkAnd pedestrian characteristics of the queried repositoryWherein the content of the first and second substances,representing the q-th query image, NqIndicating the number of query images to be presented,representing the g-th image, N, in the queried repositorygRepresenting the number of images of the queried library;
step 5.2, under the setting of searching infrared pedestrian images by visible light pedestrian images, making the images of the query library be visible light images and the images of the queried library be infrared images;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210155715.9A CN114550210B (en) | 2022-02-21 | 2022-02-21 | Pedestrian re-identification method based on modal self-adaptive mixing and invariance convolution decomposition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210155715.9A CN114550210B (en) | 2022-02-21 | 2022-02-21 | Pedestrian re-identification method based on modal self-adaptive mixing and invariance convolution decomposition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114550210A true CN114550210A (en) | 2022-05-27 |
CN114550210B CN114550210B (en) | 2024-04-02 |
Family
ID=81675054
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210155715.9A Active CN114550210B (en) | 2022-02-21 | 2022-02-21 | Pedestrian re-identification method based on modal self-adaptive mixing and invariance convolution decomposition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114550210B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117542084A (en) * | 2023-12-06 | 2024-02-09 | 湖南大学 | Cross-modal pedestrian re-recognition method based on semantic perception |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112434654A (en) * | 2020-12-07 | 2021-03-02 | 安徽大学 | Cross-modal pedestrian re-identification method based on symmetric convolutional neural network |
CN113989851A (en) * | 2021-11-10 | 2022-01-28 | 合肥工业大学 | Cross-modal pedestrian re-identification method based on heterogeneous fusion graph convolution network |
WO2022027986A1 (en) * | 2020-08-04 | 2022-02-10 | 杰创智能科技股份有限公司 | Cross-modal person re-identification method and device |
-
2022
- 2022-02-21 CN CN202210155715.9A patent/CN114550210B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022027986A1 (en) * | 2020-08-04 | 2022-02-10 | 杰创智能科技股份有限公司 | Cross-modal person re-identification method and device |
CN112434654A (en) * | 2020-12-07 | 2021-03-02 | 安徽大学 | Cross-modal pedestrian re-identification method based on symmetric convolutional neural network |
CN113989851A (en) * | 2021-11-10 | 2022-01-28 | 合肥工业大学 | Cross-modal pedestrian re-identification method based on heterogeneous fusion graph convolution network |
Non-Patent Citations (1)
Title |
---|
冯敏;张智成;吕进;余磊;韩斌;: "基于生成对抗网络的跨模态行人重识别研究", 现代信息科技, no. 04, 29 February 2020 (2020-02-29) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117542084A (en) * | 2023-12-06 | 2024-02-09 | 湖南大学 | Cross-modal pedestrian re-recognition method based on semantic perception |
Also Published As
Publication number | Publication date |
---|---|
CN114550210B (en) | 2024-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111368896B (en) | Hyperspectral remote sensing image classification method based on dense residual three-dimensional convolutional neural network | |
CN108960140B (en) | Pedestrian re-identification method based on multi-region feature extraction and fusion | |
CN107038448B (en) | Target detection model construction method | |
CN114220124A (en) | Near-infrared-visible light cross-modal double-flow pedestrian re-identification method and system | |
CN109684922B (en) | Multi-model finished dish identification method based on convolutional neural network | |
CN112347861B (en) | Human body posture estimation method based on motion feature constraint | |
CN112395442B (en) | Automatic identification and content filtering method for popular pictures on mobile internet | |
CN112651262B (en) | Cross-modal pedestrian re-identification method based on self-adaptive pedestrian alignment | |
CN111783831A (en) | Complex image accurate classification method based on multi-source multi-label shared subspace learning | |
CN110516533B (en) | Pedestrian re-identification method based on depth measurement | |
CN115410088B (en) | Hyperspectral image field self-adaption method based on virtual classifier | |
CN113283362A (en) | Cross-modal pedestrian re-identification method | |
CN115248876B (en) | Remote sensing image overall recommendation method based on content understanding | |
CN112749675A (en) | Potato disease identification method based on convolutional neural network | |
CN111523586B (en) | Noise-aware-based full-network supervision target detection method | |
CN116740763A (en) | Cross-mode pedestrian re-identification method based on dual-attention perception fusion network | |
CN114550210A (en) | Pedestrian re-identification method based on modal adaptive mixing and invariance convolution decomposition | |
CN113361370B (en) | Abnormal behavior detection method based on deep learning | |
CN110852292A (en) | Sketch face recognition method based on cross-modal multi-task depth measurement learning | |
CN107423771B (en) | Two-time-phase remote sensing image change detection method | |
CN117333948A (en) | End-to-end multi-target broiler behavior identification method integrating space-time attention mechanism | |
CN117218446A (en) | Solid waste sorting method and system based on RGB-MSI feature fusion | |
CN116681742A (en) | Visible light and infrared thermal imaging image registration method based on graph neural network | |
CN116797799A (en) | Single-target tracking method and tracking system based on channel attention and space-time perception | |
CN116597177A (en) | Multi-source image block matching method based on dual-branch parallel depth interaction cooperation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |