CN117635973B - Clothing changing pedestrian re-identification method based on multilayer dynamic concentration and local pyramid aggregation - Google Patents
Clothing changing pedestrian re-identification method based on multilayer dynamic concentration and local pyramid aggregation Download PDFInfo
- Publication number
- CN117635973B CN117635973B CN202311661718.0A CN202311661718A CN117635973B CN 117635973 B CN117635973 B CN 117635973B CN 202311661718 A CN202311661718 A CN 202311661718A CN 117635973 B CN117635973 B CN 117635973B
- Authority
- CN
- China
- Prior art keywords
- layer
- pedestrian
- image
- aggregation
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000002776 aggregation Effects 0.000 title claims abstract description 35
- 238000004220 aggregation Methods 0.000 title claims abstract description 35
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 14
- 238000000605 extraction Methods 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims abstract description 9
- 230000004927 fusion Effects 0.000 claims abstract description 8
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000012360 testing method Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 25
- 239000011159 matrix material Substances 0.000 claims description 17
- 230000007246 mechanism Effects 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 12
- 239000013598 vector Substances 0.000 claims description 11
- 230000000694 effects Effects 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 6
- 238000001514 detection method Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000012544 monitoring process Methods 0.000 claims description 4
- 238000009499 grossing Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 238000009827 uniform distribution Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims 1
- 238000005457 optimization Methods 0.000 abstract 1
- 230000003042 antagnostic effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000005021 gait Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000000090 biomarker Substances 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000011840 criminal investigation Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Landscapes
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a re-identification method of a clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation, which comprises the following steps of (1) adding a wind and rain scene to an image data set and executing standardized preprocessing and data enhancement operation; (2) constructing a sequence input to a transducer model; (3) Constructing a pedestrian feature extraction network based on a standard Transformer architecture; (4) Carrying out dynamic weight adjustment and fusion treatment on the obtained characteristics of each layer of the transducer by utilizing a multi-layer dynamic focusing module; (5) Selectively extracting and fusing specific layer characteristics in a Transformer network through a local pyramid aggregation module to obtain multi-scale characteristic information; (6) The feature output obtained according to the steps (4) - (5) is applied to a loss function to verify whether the query image and the test image are of the same category, so that training and optimization of the model are completed; the method and the device can remarkably improve the recognition accuracy and the robustness of the algorithm under a complex scene, especially when facing the re-recognition task of the clothing changing pedestrian.
Description
Technical Field
The invention relates to the technical field of computer vision image recognition, in particular to a re-recognition method for clothing changing pedestrians based on multilayer dynamic concentration and local pyramid aggregation.
Background
Pedestrian Re-identification (ReID) is a key issue in the research of the computer vision field and the public safety field, and aims to realize identity confirmation and tracking of individuals under different monitoring cameras. Existing ReID algorithms focus mainly on efficient identification strategies in the short term, but these strategies often do not adequately take into account the dynamics of pedestrian clothing changes, limiting their application in long spans of time. In practical applications, especially law enforcement and criminal investigation scenarios, important attention is paid to the fact that personnel may escape recognition by changing apparel, which puts higher demands on ReID systems. Thus, research and development of a robust long-term ReID technology (i.e., CC-ReID) is a necessary path to solve the identification problem caused by garment alterations.
Current research on CC-ReID is largely divided into two categories: the first category is the introduction of auxiliary modules (e.g., generating body contour sketches, extracting gesture key points, gait analysis, etc.) to identify clothing-independent biometric features. For example, the studies of Yang [1] et al overcome the effects of garment changes by constructing a body contour based network model. Nevertheless, this approach is susceptible to external environments (e.g., lighting and shielding) and may ignore other important biomarkers such as facial features and gait patterns. The second method is directed to separating the identity feature from the clothing feature. For example, the antagonistic feature unwrapping network (AFD-Net) proposed by Xu et al utilizes intra-class reconstruction and inter-class antagonistic mechanisms to distinguish identity-related from non-related (e.g., clothing) features. However, this approach may face challenges of high computational cost, model stability, and data dependency issues.
In recent years, a model based on a transducer architecture benefits from an advanced multi-head attention mechanism, and breakthrough achievement is achieved in the task of comprehensively analyzing a plurality of key features of an image to realize identification. The multi-head attention mechanism can effectively focus on key features of different areas of the image through parallel processing, and the adaptability and the discrimination capability of the model to various visual angle transformation and pedestrian clothing alternation are enhanced. However, the existing method mainly uses the advanced information of the top layer of the transducer to extract the discrimination features, but fails to fully use the detailed information of the lower layer of the network, which may limit the capturing capability of the model on the fine-grained features in the complex scene. To solve this problem, we propose an innovative adaptive perceptual attention mechanism and pyramid-level feature fusion network. The network design aims to realize high-efficiency integration of multi-scale information so as to enhance the recognition accuracy and robustness of the re-recognition algorithm of the clothing changing pedestrian in a complex scene.
Disclosure of Invention
The invention aims to: the invention aims to provide a re-identification method for a clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation.
The technical scheme is as follows: the invention discloses a re-identification method for a clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation, which comprises the following steps:
(1) Adding a wind and rain scene to the image data set and performing standardized preprocessing and data enhancement operations;
(2) Dividing the preprocessed image into N blocks which are consistent in size and non-overlapping, introducing additional learning embedded [ CLS_TOKEN ] as global features for sequence input, and simultaneously endowing each block with position codes [ POS_TOKEN ] to form a sequence input to a transducer model;
(3) Constructing a pedestrian feature extraction network based on a standard transducer architecture, inputting the sequence generated in the step (2), extracting pedestrian features and recording the features of each transducer layer;
(4) Carrying out dynamic weight adjustment and fusion treatment on each layer of characteristics of the transducer obtained in the step (3) by utilizing a multi-layer dynamic focusing module;
(5) Selectively extracting and fusing specific layer characteristics in a Transformer network through a local pyramid aggregation module to obtain multi-scale characteristic information, and embedding the multi-scale characteristic information into a self-attention mechanism by adopting fast Fourier transformation;
(6) And (3) applying the characteristic output obtained according to the steps (4) - (5) to a loss function to verify whether the query image and the test image are in the same category, thereby completing training and optimizing the model.
Further, the step (1) of adding a weather scene to the image data set includes the steps of:
(11) Generating a noise matrix N obeying Uniform distribution in the range of the width w and the height h of the image by using a formula N-uniformity (0, 255), and simulating random scattering effects of raindrops at different positions;
(12) Applying fuzzy processing to the noise matrix through a formula N' =n×k to generate a raindrop effect without a specific direction;
Wherein K represents a predefined fuzzy kernel, (°) represents a two-dimensional convolution operation;
(13) Constructing a diagonal matrix D to represent a straight-line falling path of the raindrops; simulating the inclination of the raindrops by rotating the diagonal matrix D, and then reproducing the falling speed and direction of the raindrops in the air by using Gaussian blur processing, so as to finally obtain a blur kernel M for simulating the raindrops;
(14) By the formula: Fusing the simulated weather effect with the original image;
Wherein C represents an image channel, beta is a mixed weight, and N' is a noise matrix after a fuzzy core.
Further, the standardized preprocessing and data enhancement operation in the step (1) includes: horizontal overturn, random clipping and random erasing.
Further, the step (2) specifically includes the following steps:
Setting an image x to belong to R W*H*C, wherein H, W and C respectively represent the height, the width and the channel number of the image x;
first, the image is divided into N non-overlapping blocks, denoted as Secondly, introducing an additional learning embeddable x cls as a feature representation of the aggregation at the beginning of the input sequence; then, adding a position code P behind the feature vector of each image block; finally, the input sequence transmitted to the transducer layer is formulated as:
Z0=[xcls;F(x1);F(x2);...;F(xN)]+P
Wherein Z 0 represents the embedding of the input sequence; p ε R (N+1)*D represents the location embedding; f is a linear projection function that maps the image to D-dimensional space.
Further, the step (3) specifically includes the following steps:
The input sequence Z 0 is input into a transducer network for processing, each layer refines the characteristics and integrates the context information through a multi-head self-attention mechanism, and the output Z l of the first layer can be calculated by the following ways:
Zl=Transformerlayer(Zl-1),l=1,2,...,L
Wherein TransformerLayer represents a layer in the standard transducer, and L represents the sum of layers;
The output of each layer of transformers Z 1,Z2,...,ZL.
Further, the step (4) includes the following steps:
(41) Constructing a weight vector W= { W 1,w2,...,wL }, wherein W i is the importance of the features extracted by the ith layer in the corresponding model hierarchy; weighting each layer by using orthogonality constraint weighting; the specific weighted calculation formula is as follows:
wherein f i represents the feature importance of the i-th layer, initialized to a uniform value on all layers; beta and gamma are learnable parameters; < F i,Fj > represents the inner product between the feature sets of the i-th and j-th layers as a measure of their feature correlation; alpha is a regularization coefficient; l is the total number of layers.
(42) And introducing an L2 regularization term to calculate fused characteristics, wherein the formula is as follows:
where λ is a non-negative regularization parameter used to mitigate overfitting by limiting the magnitude of the weights within the model; Is the Frobenius norm of the weight matrix W, and the sum of squares of all layer weights is calculated.
Further, the step (5) specifically comprises the following steps:
In the local pyramid aggregation module, output features f 1,f2,f3,f4 of four different convertors layers are selected as input, and convolution block operations are performed respectively:
First, a1×1 convolution layer is used; secondly, feature dimensions are adjusted and nonlinearities are introduced using BacthNorm D and ReLU functions; then, adding a self-attention mechanism of the fast Fourier transform, and optimizing the characteristics by using global information of all elements in the sequence; and finally, connecting all the features, and inputting the features into the same convolution block to obtain the fused features. The formula is as follows:
Wherein the method comprises the steps of Representing the entire convolution block operation, f t represents the features resulting from the fusion of f m and f m+1. As shown in fig. 2, three outputs are finally obtained by the local pyramid aggregation module.
Further, the step (6) of the loss function includes: ID loss and triplet loss; the ID loss adopts a traditional cross entropy loss function, and tag smoothing is not included; the formula is as follows:
Wherein C is the class number, y i is the one-hot coding of the real label, and p i is the probability that the model prediction sample belongs to the ith class.
The triplet loss formula is as follows:
wherein d (ap) and d (an) each represent an anchor sample And positive samples/>And negative sample/>A distance therebetween; the super parameter M is used as the lower limit of the distance between the positive and negative sample pairs, and M is the upper limit;
wherein the function f (·) represents a feature extraction operator mapping the input image to the embedding space; Representing an L2 norm for calculating the Euclidean distance between two feature vectors; [ ] + is a hinge loss function, the loss is calculated only if the value in brackets is positive, otherwise the loss is 0;
The total loss function formula L is as follows:
where N represents the output volume produced by the entire training architecture, initially the loss of each output is set to equal weight, denoted w i (i=0, 1,2, 3); the weights of the various parts are then dynamically adjusted during the training process by a back propagation algorithm.
Judging whether the maximum iteration times are reached, if so, outputting the final model precision, and if not, repeating the steps (2) - (5).
Further, the method also comprises the following steps: (0) constructing a monitoring network to obtain pedestrian video data; detecting pedestrians by adopting a target detection algorithm, and then obtaining a pedestrian detection frame by adopting a target tracking algorithm; pedestrian video sequences clipped to 258 x 128 pixel specification form an atlas gallery.
The electronic equipment comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the computer program realizes any one of the clothing changing pedestrian re-identification methods based on multilayer dynamic concentration and local pyramid aggregation when being loaded to the processor.
The beneficial effects are that: compared with the prior art, the invention has the following remarkable advantages: by combining the detail information of the lower layer of the network, the fine granularity characteristics in the complex scene can be more effectively captured and processed; the pyramid-level feature fusion network can integrate information of different levels, so that more comprehensive data analysis and processing are provided; under a complex scene, particularly when facing a re-recognition task of a clothing changing pedestrian, the method can remarkably improve the recognition precision and robustness of the algorithm; and each level of the transducer network is utilized more comprehensively, and the limitation of the transducer network in processing complex scenes is overcome.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a network structure diagram based on a multi-layer dynamic centralized and local pyramid aggregation framework;
FIG. 3 is a block structure diagram of a local pyramid aggregation module within a multi-layer dynamic concentration and local pyramid aggregation framework in accordance with the present invention:
FIG. 4 is a schematic illustration of self-attention in combination with fast Fourier transform in a dynamic small-strand pedestrian re-recognition framework based on the graph neural network of the present invention;
fig. 5 is a schematic view of a pedestrian image with a weather scene added thereto according to the present invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings.
As shown in fig. 1-5, an embodiment of the present invention provides a method for identifying a re-clothing pedestrian based on multilayer dynamic concentration and local pyramid aggregation, including the following steps:
(0) Constructing a monitoring network to acquire pedestrian video data; detecting pedestrians by adopting a target detection algorithm, and then obtaining a pedestrian detection frame by adopting a target tracking algorithm; a pedestrian video sequence cut into 258 x 128 pixel specifications forms an atlas gallery;
(1) Adding a wind and rain scene to the image data set and performing standardized preprocessing and data enhancement operations; adding a weather scene to an image dataset comprises the steps of:
(11) Generating a noise matrix N which is subjected to Uniform distribution within the range of the width w and the height h of the image by using a formula N-uniformity (0, 255), and simulating random scattering effects of raindrops at different positions;
(12) Applying fuzzy processing to the noise matrix through a formula N' =n×k to generate a raindrop effect without a specific direction;
Wherein K represents a predefined fuzzy kernel, (°) represents a two-dimensional convolution operation;
(13) Constructing a diagonal matrix D to represent a straight-line falling path of the raindrops; simulating the inclination of the raindrops by rotating the diagonal matrix D, and then reproducing the falling speed and direction of the raindrops in the air by using Gaussian blur processing, so as to finally obtain a blur kernel M for simulating the raindrops;
(14) By the formula: Fusing the simulated weather effect with the original image;
Wherein C represents an image channel, beta is a mixed weight, and N' is a noise matrix after a fuzzy core;
the standardized preprocessing and data enhancement operations include: horizontal overturn, random clipping and random erasing.
(2) Dividing the preprocessed image into N blocks which are consistent in size and non-overlapping, introducing additional learning embedded [ CLS_TOKEN ] as global features for sequence input, and simultaneously endowing each block with position codes [ POS_TOKEN ] to form a sequence input to a transducer model; the method comprises the following steps:
Setting an image x to belong to R W*H*C, wherein H, W and C respectively represent the height, the width and the channel number of the image x;
first, the image is divided into N non-overlapping blocks, denoted as Secondly, introducing an additional learning embeddable x cls as a feature representation of the aggregation at the beginning of the input sequence; then, adding a position code P behind the feature vector of each image block; finally, the input sequence transmitted to the transducer layer is formulated as:
Z0=[xcls;F(x1);F(x2);...;F(xN)]+P
Wherein Z 0 represents the embedding of the input sequence; p ε R (N+1)*D represents the location embedding; f is a linear projection function that maps the image to D-dimensional space.
(3) Constructing a pedestrian feature extraction network based on a standard transducer architecture, inputting the sequence generated in the step (2), extracting pedestrian features and recording the features of each transducer layer; the method comprises the following steps:
The input sequence Z 0 is input into a transducer network for processing, each layer refines the characteristics and integrates the context information through a multi-head self-attention mechanism, and the output Z l of the first layer can be calculated by the following ways:
Zl=Transformerlayer(Zl-1),l=1,2,...,L
Wherein TransformerLayer represents a layer in the standard transducer, and L represents the sum of layers;
The output of each layer of transformers Z 1,Z2,...,ZL.
(4) Carrying out dynamic weight adjustment and fusion treatment on each layer of characteristics of the transducer obtained in the step (3) by utilizing a multi-layer dynamic focusing module; the method comprises the following steps:
(41) Constructing a weight vector W= { W 1,w2,...,wL }, wherein W i is the importance of the features extracted by the ith layer in the corresponding model hierarchy; weighting each layer by using orthogonality constraint weighting; the specific weighted calculation formula is as follows:
wherein f i represents the feature importance of the i-th layer, initialized to a uniform value on all layers; beta and gamma are learnable parameters; < F i,Fj > represents the inner product between the feature sets of the i-th and j-th layers as a measure of their feature correlation; alpha is a regularization coefficient; l is the total number of layers.
(42) And introducing an L2 regularization term to calculate fused characteristics, wherein the formula is as follows:
where λ is a non-negative regularization parameter used to mitigate overfitting by limiting the magnitude of the weights within the model; Is the Frobenius norm of the weight matrix W, and the sum of squares of all layer weights is calculated.
(5) Selectively extracting and fusing specific layer characteristics in a Transformer network through a local pyramid aggregation module to obtain multi-scale characteristic information, and embedding the multi-scale characteristic information into a self-attention mechanism by adopting fast Fourier transformation; the method comprises the following steps:
In the local pyramid aggregation module, output features f 1,f2,f3,f4 of four different convertors layers are selected as input, and convolution block operations are performed respectively:
First, a1×1 convolution layer is used; secondly, feature dimensions are adjusted and nonlinearities are introduced using BacthNorm D and ReLU functions; then, adding a self-attention mechanism of the fast Fourier transform, and optimizing the characteristics by using global information of all elements in the sequence; and finally, connecting all the features, and inputting the features into the same convolution block to obtain the fused features. The formula is as follows:
Wherein the method comprises the steps of Representing the entire convolution block operation, f t represents the features resulting from the fusion of f m and f m+1. As shown in fig. 2, three outputs are finally obtained by the local pyramid aggregation module.
The self-attention mechanism of the fast Fourier transform is as follows:
first, the self-attention module receives the input X εR B*N*C, where B is the batch size, N is the sequence length, and C is the feature dimension. Second, through three linear layers, input X is converted into query Q, key K, and value V: q=xw Q,K=XWK,V=XWV. Wherein W Q,WK and W V are both learnable weight matrices; then, the query, key, and value are split into multiple heads; the Fast Fourier Transform (FFT) algorithm exhibits the best efficiency when the input size is an integer power of 2. Finally, apply the appropriate padding to Q and K, apply FFT to padded Q padded and K padded, and estimate their correlation in the frequency domain. The output formula is as follows:
Attn=Softmax(F-1(F(Qpadded)⊙F(Kpadded))[:,:,:,:Q.size(1)])
Wherein F (·) and F -1 (·) represent FFT and inverse FFT, respectively. The dot product of the FFT result is calculated first, and the dot product result is processed with inverse FFT (IFFT) and truncated to the original size. The dot product result obtained in the previous step is normalized by using a softmax function, and the attention weight Attn is obtained. Then, the attention score and the corresponding value vector are weighted and aggregated through dot product operation, and the result is added with the input X to obtain the self-attention output with enhanced characteristics:
Out=Attn⊙V+X
(6) And (3) applying the characteristic output obtained according to the steps (4) - (5) to a loss function to verify whether the query image and the test image are in the same category, thereby completing training and optimizing the model. The loss function includes: ID loss and triplet loss; the ID loss adopts a traditional cross entropy loss function, and tag smoothing is not included; the formula is as follows:
Wherein C is the class number, y i is the one-hot coding of the real label, and p i is the probability that the model prediction sample belongs to the ith class.
The triplet loss formula is as follows:
wherein d (ap) and d (an) each represent an anchor sample And positive samples/>And negative sample/>A distance therebetween; the super parameter M is used as the lower limit of the distance between the positive and negative sample pairs, and M is the upper limit;
wherein the function f (·) represents a feature extraction operator mapping the input image to the embedding space; Representing an L2 norm for calculating the Euclidean distance between two feature vectors; [ ] + is a hinge loss function, the loss is calculated only if the value in brackets is positive, otherwise the loss is 0;
The total loss function formula L is as follows:
where N represents the output volume produced by the entire training architecture, initially the loss of each output is set to equal weight, denoted w i (i=0, 1,2, 3); the weights of the various parts are then dynamically adjusted during the training process by a back propagation algorithm.
Judging whether the maximum iteration times are reached, if so, outputting the final model precision, and if not, repeating the steps (2) - (5).
The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes any one of the clothing changing pedestrian re-identification methods based on multilayer dynamic concentration and local pyramid aggregation when being loaded to the processor.
Claims (8)
1. The clothing changing pedestrian re-identification method based on multilayer dynamic concentration and local pyramid aggregation is characterized by comprising the following steps of:
(1) Adding a wind and rain scene to the image data set and performing standardized preprocessing and data enhancement operations;
(2) Dividing the preprocessed image into Q blocks which are consistent in size and non-overlapping, introducing additional learnable embedded [ CLS_TOKEN ] as global features for sequence input, and simultaneously endowing each block with position codes [ POS_TOKEN ] to form a sequence Z 0 input to a pedestrian feature extraction network;
(3) Constructing a pedestrian feature extraction network based on a standard fransformer architecture, inputting the sequence generated in the step (2), extracting pedestrian features, and recording output features Z l, l=1, 2, and L of each fransformer layer; l is the number of layers of a transducer layer included in the pedestrian feature extraction network;
(4) Carrying out dynamic weight adjustment and fusion treatment on the output characteristics of each transducer layer obtained in the step (3) by utilizing a multi-layer dynamic focusing module;
The step (4) comprises the following steps:
(41) Constructing a weight vector W= { W 1,w2,...,wL }, wherein W i is the weight of the feature output by the ith layer of the Transformer layer in the pedestrian feature extraction network; weighting each layer of the Transformer layer by utilizing orthogonality constraint weighting; the specific weighted calculation formula is as follows:
Wherein g i represents the feature importance of the i-th layer, initialized to a uniform value on all layers; beta and gamma are learnable parameters; < Z i,Zj > represents the inner product between the output features of the i-th layer and j-th layer of the convertors, which is a measure of their output feature correlation; alpha is a regularization coefficient;
(42) And introducing an L2 regularization term to calculate fused characteristics, wherein the formula is as follows:
where λ is a non-negative regularization parameter used to mitigate overfitting by limiting the magnitude of the weights within the model; the Frobenius norm of the weight vector W is used for calculating the square sum of all the weights of the Transformer layers;
(5) Selectively extracting and fusing the output characteristics of a specific Transformer layer in the pedestrian characteristic extraction network through a local pyramid aggregation module to obtain multi-scale characteristic information, and embedding the multi-scale characteristic information into a self-attention mechanism by adopting fast Fourier transformation;
The step (5) is specifically as follows:
In the local pyramid aggregation module, selecting output features f 1,f2,f3,f4 of four different convertors layers as input, and performing three-layer pyramid feature aggregation operation, wherein each feature aggregation operation comprises the steps of connecting self-attention outputs obtained by respectively performing convolution block calculation on two inputs, and inputting the self-attention outputs into the same convolution block to obtain fused features; three outputs are finally obtained through the local pyramid aggregation module;
The convolution block is calculated specifically as follows: first, a1×1 convolution layer is used; secondly, feature dimensions are adjusted and nonlinearities are introduced using BacthNorm D and ReLU functions; then, adding a self-attention mechanism of fast Fourier transformation to obtain self-attention output with enhanced characteristics;
(6) And (3) applying the characteristic output obtained according to the steps (4) - (5) to a loss function to verify whether the query image and the test image are in the same category, thereby completing training and optimizing the model.
2. The method for re-identifying a dressing pedestrian based on multi-layer dynamic concentration and local pyramid aggregation according to claim 1, wherein the step (1) of adding a weather scene to the image dataset comprises the steps of:
(11) Generating a noise matrix N which is subjected to Uniform distribution within the range of the width W and the height H of the image by using a formula N-uniformity (0, 255), and simulating random scattering effects of raindrops at different positions;
(12) Applying fuzzy processing to the noise matrix through a formula N' =n×k to generate a raindrop effect without a specific direction; where K represents a predefined fuzzy kernel, representing a two-dimensional convolution operation;
(13) Constructing a diagonal matrix D to represent a straight-line falling path of the raindrops; simulating the inclination of the raindrops by rotating the diagonal matrix D, and then reproducing the falling speed and direction of the raindrops in the air by using Gaussian blur processing, so as to finally obtain a blur kernel M for simulating the raindrops;
(14) By the formula: Fusing the simulated weather effect with the original image;
Wherein I C represents the original image, β is the mixing weight, and N "is the noise matrix after blurring kernel.
3. The method for re-identifying a clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation according to claim 1, wherein the standardized preprocessing and data enhancement operations in step (1) comprise: horizontal overturn, random clipping and random erasing.
4. The method for re-identifying the clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation according to claim 1, wherein the step (2) is specifically as follows:
Setting an image x to belong to R W*H*C, wherein H, W and C respectively represent the height, the width and the channel number of the image x;
First, the image is partitioned into Q non-overlapping blocks, denoted { x i |i=1, 2, …, Q }; secondly, introducing an additional learning embeddable x cls as a feature representation of the aggregation at the beginning of the input sequence; then, adding a position code P behind the feature vector of each image block; finally, the input sequence transmitted to the transducer layer is formulated as:
Z0=[xcls;F(x1);F(x2);...;F(xQ)]+P
Wherein Z 0 represents an input sequence; p ε R (Q+1)*D represents the location embedding; f is a linear projection function that maps the image to D-dimensional space.
5. The method for re-identifying the clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation according to claim 1, wherein the step (3) is specifically as follows:
The input sequence Z 0 is input into the pedestrian feature extraction network for processing, each layer refines the features and integrates the context information through a multi-head self-attention mechanism, and the output features Z l of the first layer can be calculated by the following modes:
Zl=Transformerlayer(Zl-1),l=1,2,...,L
Wherein TransformerLayer represents a layer in the standard transducer architecture;
The output features of each transducer layer constitute { Z 1,Z2,...,ZL }.
6. The method for re-identifying a clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation according to claim 1, wherein the step (6) loss function comprises: ID loss and triplet loss; the ID loss adopts a traditional cross entropy loss function, and tag smoothing is not included; the formula is as follows:
Wherein B is the class number, y i is the one-hot code of the real label, and p i is the probability that the model prediction sample belongs to the ith class;
the triplet loss formula is as follows:
wherein d (ap) and d (an) each represent an anchor sample And positive samples/>And negative sample/>A distance therebetween; the hyper-parameter m serves as a lower limit for the distance between positive and negative pairs of samples:
wherein the function f (·) represents a feature extraction operator mapping the input image to the embedding space; Representing an L2 norm for calculating the Euclidean distance between two feature vectors; [ ] + is a hinge loss function, the loss is calculated only if the value in brackets is positive, otherwise the loss is 0;
The total loss function formula L is as follows:
Wherein initially the loss of each output is set to an equal weight, denoted u i, where i = 0,1,2,3; then dynamically adjusting the weight of each part through a back propagation algorithm in the training process;
Judging whether the maximum iteration times are reached, if so, outputting the final model precision, and if not, repeating the steps (2) - (5).
7. The method for re-identifying a clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation according to claim 1, further comprising the steps of: (0) constructing a monitoring network to obtain pedestrian video data; detecting pedestrians by adopting a target detection algorithm, and then obtaining a pedestrian detection frame by adopting a target tracking algorithm; pedestrian video sequences clipped to 258 x 128 pixel specification constitute an image dataset.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when loaded into the processor implements a method for re-identification of a clothing change pedestrian based on multi-layer dynamic concentration and local pyramid aggregation according to any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311661718.0A CN117635973B (en) | 2023-12-06 | 2023-12-06 | Clothing changing pedestrian re-identification method based on multilayer dynamic concentration and local pyramid aggregation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311661718.0A CN117635973B (en) | 2023-12-06 | 2023-12-06 | Clothing changing pedestrian re-identification method based on multilayer dynamic concentration and local pyramid aggregation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117635973A CN117635973A (en) | 2024-03-01 |
CN117635973B true CN117635973B (en) | 2024-05-10 |
Family
ID=90023146
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311661718.0A Active CN117635973B (en) | 2023-12-06 | 2023-12-06 | Clothing changing pedestrian re-identification method based on multilayer dynamic concentration and local pyramid aggregation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117635973B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113627266A (en) * | 2021-07-15 | 2021-11-09 | 武汉大学 | Video pedestrian re-identification method based on Transformer space-time modeling |
CN115482508A (en) * | 2022-09-26 | 2022-12-16 | 天津理工大学 | Reloading pedestrian re-identification method, reloading pedestrian re-identification device, reloading pedestrian re-identification equipment and computer-storable medium |
CN115631513A (en) * | 2022-11-10 | 2023-01-20 | 杭州电子科技大学 | Multi-scale pedestrian re-identification method based on Transformer |
JP2023523502A (en) * | 2021-04-07 | 2023-06-06 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Model training methods, pedestrian re-identification methods, devices and electronics |
CN116486433A (en) * | 2023-04-10 | 2023-07-25 | 浙江大学 | Re-identification method based on cross self-distillation converter re-identification network |
CN116977817A (en) * | 2023-04-28 | 2023-10-31 | 浙江工商大学 | Pedestrian re-recognition method based on multi-scale feature learning |
CN117011883A (en) * | 2023-05-16 | 2023-11-07 | 沈阳化工大学 | Pedestrian re-recognition method based on pyramid convolution and transducer double branches |
-
2023
- 2023-12-06 CN CN202311661718.0A patent/CN117635973B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2023523502A (en) * | 2021-04-07 | 2023-06-06 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Model training methods, pedestrian re-identification methods, devices and electronics |
CN113627266A (en) * | 2021-07-15 | 2021-11-09 | 武汉大学 | Video pedestrian re-identification method based on Transformer space-time modeling |
CN115482508A (en) * | 2022-09-26 | 2022-12-16 | 天津理工大学 | Reloading pedestrian re-identification method, reloading pedestrian re-identification device, reloading pedestrian re-identification equipment and computer-storable medium |
CN115631513A (en) * | 2022-11-10 | 2023-01-20 | 杭州电子科技大学 | Multi-scale pedestrian re-identification method based on Transformer |
CN116486433A (en) * | 2023-04-10 | 2023-07-25 | 浙江大学 | Re-identification method based on cross self-distillation converter re-identification network |
CN116977817A (en) * | 2023-04-28 | 2023-10-31 | 浙江工商大学 | Pedestrian re-recognition method based on multi-scale feature learning |
CN117011883A (en) * | 2023-05-16 | 2023-11-07 | 沈阳化工大学 | Pedestrian re-recognition method based on pyramid convolution and transducer double branches |
Non-Patent Citations (6)
Title |
---|
Cloth-Changing Person Re-identification from A Single Image with Gait Prediction and Regularization;Xin Jin等;https://arxiv.org/pdf/2103.15537.pdf;20220331;全文 * |
Multi-Biometric Unified Network for Cloth-Changing Person Re-Identification;Guoqing Zhang等;2022 IEEE International Conference on Multimedia and Expo (ICME);20220826;全文 * |
Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval;Xianghao Zang等;https://arxiv.org/pdf/2202.06014.pdf;20220406;全文 * |
Specialized Re-Ranking: A Novel Retrieval-Verification Framework for Cloth Changing Person Re-Identification;Renjie Zhang等;https://arxiv.org/pdf/2210.03592.pdf;20221007;全文 * |
TransReID: Transformer-based Object Re-Identification;Shuting He等;https://arxiv.org/pdf/2102.04378.pdf;20210326;全文 * |
基于CNN和TransFormer多尺度学习行人重识别方法;陈莹等;电子与信息学报;20230630;第45卷(第6期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117635973A (en) | 2024-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110348319B (en) | Face anti-counterfeiting method based on face depth information and edge image fusion | |
CN106295601B (en) | A kind of improved Safe belt detection method | |
Aggarwal et al. | Image surface texture analysis and classification using deep learning | |
CN113221641B (en) | Video pedestrian re-identification method based on generation of antagonism network and attention mechanism | |
CN111709311A (en) | Pedestrian re-identification method based on multi-scale convolution feature fusion | |
CN105825183B (en) | Facial expression recognizing method based on partial occlusion image | |
CN113011357B (en) | Depth fake face video positioning method based on space-time fusion | |
CN103714326B (en) | One-sample face identification method | |
CN108154133B (en) | Face portrait-photo recognition method based on asymmetric joint learning | |
CN107392213B (en) | Face portrait synthesis method based on depth map model feature learning | |
CN105893941B (en) | A kind of facial expression recognizing method based on area image | |
Tereikovskyi et al. | The method of semantic image segmentation using neural networks | |
CN111126155B (en) | Pedestrian re-identification method for generating countermeasure network based on semantic constraint | |
Zhang | Application of artificial intelligence recognition technology in digital image processing | |
CN117635973B (en) | Clothing changing pedestrian re-identification method based on multilayer dynamic concentration and local pyramid aggregation | |
CN117036904A (en) | Attention-guided semi-supervised corn hyperspectral image data expansion method | |
Liu et al. | Iris double recognition based on modified evolutionary neural network | |
CN114360058B (en) | Cross-view gait recognition method based on walking view prediction | |
CN112348011B (en) | Vehicle damage assessment method and device and storage medium | |
Liu et al. | Weather recognition of street scene based on sparse deep neural networks | |
Vankayalapati et al. | Interfacing Kera’s deep learning technique for Real-Time Age and Gender Prediction | |
Zhou et al. | Layer-weakening feature fusion network for remote sensing detection | |
Li et al. | Facial feature localisation and subtle expression recognition based on deep convolution neural network | |
Fang et al. | Learning deep compact channel features for object detection in traffic scenes | |
Tumati et al. | Face Invariant Classification and Detection of Mythology Characters Using Custom Dataset (ClaDeMuC-CD) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |