CN117635973B

CN117635973B - Clothing changing pedestrian re-identification method based on multilayer dynamic concentration and local pyramid aggregation

Info

Publication number: CN117635973B
Application number: CN202311661718.0A
Authority: CN
Inventors: 张国庆; 周洁琼
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-12-06
Filing date: 2023-12-06
Publication date: 2024-05-10
Anticipated expiration: 2043-12-06
Also published as: CN117635973A

Abstract

The invention discloses a re-identification method of a clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation, which comprises the following steps of (1) adding a wind and rain scene to an image data set and executing standardized preprocessing and data enhancement operation; (2) constructing a sequence input to a transducer model; (3) Constructing a pedestrian feature extraction network based on a standard Transformer architecture; (4) Carrying out dynamic weight adjustment and fusion treatment on the obtained characteristics of each layer of the transducer by utilizing a multi-layer dynamic focusing module; (5) Selectively extracting and fusing specific layer characteristics in a Transformer network through a local pyramid aggregation module to obtain multi-scale characteristic information; (6) The feature output obtained according to the steps (4) - (5) is applied to a loss function to verify whether the query image and the test image are of the same category, so that training and optimization of the model are completed; the method and the device can remarkably improve the recognition accuracy and the robustness of the algorithm under a complex scene, especially when facing the re-recognition task of the clothing changing pedestrian.

Description

Clothing changing pedestrian re-identification method based on multilayer dynamic concentration and local pyramid aggregation

Technical Field

The invention relates to the technical field of computer vision image recognition, in particular to a re-recognition method for clothing changing pedestrians based on multilayer dynamic concentration and local pyramid aggregation.

Background

Pedestrian Re-identification (ReID) is a key issue in the research of the computer vision field and the public safety field, and aims to realize identity confirmation and tracking of individuals under different monitoring cameras. Existing ReID algorithms focus mainly on efficient identification strategies in the short term, but these strategies often do not adequately take into account the dynamics of pedestrian clothing changes, limiting their application in long spans of time. In practical applications, especially law enforcement and criminal investigation scenarios, important attention is paid to the fact that personnel may escape recognition by changing apparel, which puts higher demands on ReID systems. Thus, research and development of a robust long-term ReID technology (i.e., CC-ReID) is a necessary path to solve the identification problem caused by garment alterations.

Current research on CC-ReID is largely divided into two categories: the first category is the introduction of auxiliary modules (e.g., generating body contour sketches, extracting gesture key points, gait analysis, etc.) to identify clothing-independent biometric features. For example, the studies of Yang [1] et al overcome the effects of garment changes by constructing a body contour based network model. Nevertheless, this approach is susceptible to external environments (e.g., lighting and shielding) and may ignore other important biomarkers such as facial features and gait patterns. The second method is directed to separating the identity feature from the clothing feature. For example, the antagonistic feature unwrapping network (AFD-Net) proposed by Xu et al utilizes intra-class reconstruction and inter-class antagonistic mechanisms to distinguish identity-related from non-related (e.g., clothing) features. However, this approach may face challenges of high computational cost, model stability, and data dependency issues.

In recent years, a model based on a transducer architecture benefits from an advanced multi-head attention mechanism, and breakthrough achievement is achieved in the task of comprehensively analyzing a plurality of key features of an image to realize identification. The multi-head attention mechanism can effectively focus on key features of different areas of the image through parallel processing, and the adaptability and the discrimination capability of the model to various visual angle transformation and pedestrian clothing alternation are enhanced. However, the existing method mainly uses the advanced information of the top layer of the transducer to extract the discrimination features, but fails to fully use the detailed information of the lower layer of the network, which may limit the capturing capability of the model on the fine-grained features in the complex scene. To solve this problem, we propose an innovative adaptive perceptual attention mechanism and pyramid-level feature fusion network. The network design aims to realize high-efficiency integration of multi-scale information so as to enhance the recognition accuracy and robustness of the re-recognition algorithm of the clothing changing pedestrian in a complex scene.

Disclosure of Invention

The invention aims to: the invention aims to provide a re-identification method for a clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation.

The technical scheme is as follows: the invention discloses a re-identification method for a clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation, which comprises the following steps:

(1) Adding a wind and rain scene to the image data set and performing standardized preprocessing and data enhancement operations;

(2) Dividing the preprocessed image into N blocks which are consistent in size and non-overlapping, introducing additional learning embedded [ CLS_TOKEN ] as global features for sequence input, and simultaneously endowing each block with position codes [ POS_TOKEN ] to form a sequence input to a transducer model;

(3) Constructing a pedestrian feature extraction network based on a standard transducer architecture, inputting the sequence generated in the step (2), extracting pedestrian features and recording the features of each transducer layer;

(4) Carrying out dynamic weight adjustment and fusion treatment on each layer of characteristics of the transducer obtained in the step (3) by utilizing a multi-layer dynamic focusing module;

(5) Selectively extracting and fusing specific layer characteristics in a Transformer network through a local pyramid aggregation module to obtain multi-scale characteristic information, and embedding the multi-scale characteristic information into a self-attention mechanism by adopting fast Fourier transformation;

(6) And (3) applying the characteristic output obtained according to the steps (4) - (5) to a loss function to verify whether the query image and the test image are in the same category, thereby completing training and optimizing the model.

Further, the step (1) of adding a weather scene to the image data set includes the steps of:

(11) Generating a noise matrix N obeying Uniform distribution in the range of the width w and the height h of the image by using a formula N-uniformity (0, 255), and simulating random scattering effects of raindrops at different positions;

(12) Applying fuzzy processing to the noise matrix through a formula N' =n×k to generate a raindrop effect without a specific direction;

Wherein K represents a predefined fuzzy kernel, (°) represents a two-dimensional convolution operation;

(13) Constructing a diagonal matrix D to represent a straight-line falling path of the raindrops; simulating the inclination of the raindrops by rotating the diagonal matrix D, and then reproducing the falling speed and direction of the raindrops in the air by using Gaussian blur processing, so as to finally obtain a blur kernel M for simulating the raindrops;

(14) By the formula: Fusing the simulated weather effect with the original image;

Wherein C represents an image channel, beta is a mixed weight, and N' is a noise matrix after a fuzzy core.

Further, the standardized preprocessing and data enhancement operation in the step (1) includes: horizontal overturn, random clipping and random erasing.

Further, the step (2) specifically includes the following steps:

Setting an image x to belong to R ^W*H*C, wherein H, W and C respectively represent the height, the width and the channel number of the image x;

first, the image is divided into N non-overlapping blocks, denoted as Secondly, introducing an additional learning embeddable x _cls as a feature representation of the aggregation at the beginning of the input sequence; then, adding a position code P behind the feature vector of each image block; finally, the input sequence transmitted to the transducer layer is formulated as:

Z₀＝[x_cls;F(x¹);F(x²);...;F(x^N)]+P

Wherein Z ₀ represents the embedding of the input sequence; p ε R ^(N+1)*D represents the location embedding; f is a linear projection function that maps the image to D-dimensional space.

Further, the step (3) specifically includes the following steps:

The input sequence Z ₀ is input into a transducer network for processing, each layer refines the characteristics and integrates the context information through a multi-head self-attention mechanism, and the output Z ^l of the first layer can be calculated by the following ways:

Z^l＝Transformerlayer(Z^l-1),l＝1,2,...,L

Wherein TransformerLayer represents a layer in the standard transducer, and L represents the sum of layers;

The output of each layer of transformers Z ¹,Z²,...,Z^L.

Further, the step (4) includes the following steps:

(41) Constructing a weight vector W= { W ₁,w₂,...,w_L }, wherein W _i is the importance of the features extracted by the ith layer in the corresponding model hierarchy; weighting each layer by using orthogonality constraint weighting; the specific weighted calculation formula is as follows:

wherein f _i represents the feature importance of the i-th layer, initialized to a uniform value on all layers; beta and gamma are learnable parameters; < F _i,F_j > represents the inner product between the feature sets of the i-th and j-th layers as a measure of their feature correlation; alpha is a regularization coefficient; l is the total number of layers.

(42) And introducing an L2 regularization term to calculate fused characteristics, wherein the formula is as follows:

where λ is a non-negative regularization parameter used to mitigate overfitting by limiting the magnitude of the weights within the model; Is the Frobenius norm of the weight matrix W, and the sum of squares of all layer weights is calculated.

Further, the step (5) specifically comprises the following steps:

In the local pyramid aggregation module, output features f ₁,f₂,f₃,f₄ of four different convertors layers are selected as input, and convolution block operations are performed respectively:

First, a1×1 convolution layer is used; secondly, feature dimensions are adjusted and nonlinearities are introduced using BacthNorm D and ReLU functions; then, adding a self-attention mechanism of the fast Fourier transform, and optimizing the characteristics by using global information of all elements in the sequence; and finally, connecting all the features, and inputting the features into the same convolution block to obtain the fused features. The formula is as follows:

Wherein the method comprises the steps of Representing the entire convolution block operation, f _t represents the features resulting from the fusion of f _m and f _m+1. As shown in fig. 2, three outputs are finally obtained by the local pyramid aggregation module.

Further, the step (6) of the loss function includes: ID loss and triplet loss; the ID loss adopts a traditional cross entropy loss function, and tag smoothing is not included; the formula is as follows:

Wherein C is the class number, y _i is the one-hot coding of the real label, and p _i is the probability that the model prediction sample belongs to the ith class.

The triplet loss formula is as follows:

wherein d (ap) and d (an) each represent an anchor sample And positive samples/>And negative sample/>A distance therebetween; the super parameter M is used as the lower limit of the distance between the positive and negative sample pairs, and M is the upper limit;

wherein the function f (·) represents a feature extraction operator mapping the input image to the embedding space; Representing an L2 norm for calculating the Euclidean distance between two feature vectors; [ ] ₊ is a hinge loss function, the loss is calculated only if the value in brackets is positive, otherwise the loss is 0;

The total loss function formula L is as follows:

where N represents the output volume produced by the entire training architecture, initially the loss of each output is set to equal weight, denoted w _i (i=0, 1,2, 3); the weights of the various parts are then dynamically adjusted during the training process by a back propagation algorithm.

Judging whether the maximum iteration times are reached, if so, outputting the final model precision, and if not, repeating the steps (2) - (5).

Further, the method also comprises the following steps: (0) constructing a monitoring network to obtain pedestrian video data; detecting pedestrians by adopting a target detection algorithm, and then obtaining a pedestrian detection frame by adopting a target tracking algorithm; pedestrian video sequences clipped to 258 x 128 pixel specification form an atlas gallery.

The electronic equipment comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the computer program realizes any one of the clothing changing pedestrian re-identification methods based on multilayer dynamic concentration and local pyramid aggregation when being loaded to the processor.

The beneficial effects are that: compared with the prior art, the invention has the following remarkable advantages: by combining the detail information of the lower layer of the network, the fine granularity characteristics in the complex scene can be more effectively captured and processed; the pyramid-level feature fusion network can integrate information of different levels, so that more comprehensive data analysis and processing are provided; under a complex scene, particularly when facing a re-recognition task of a clothing changing pedestrian, the method can remarkably improve the recognition precision and robustness of the algorithm; and each level of the transducer network is utilized more comprehensively, and the limitation of the transducer network in processing complex scenes is overcome.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a network structure diagram based on a multi-layer dynamic centralized and local pyramid aggregation framework;

FIG. 3 is a block structure diagram of a local pyramid aggregation module within a multi-layer dynamic concentration and local pyramid aggregation framework in accordance with the present invention:

FIG. 4 is a schematic illustration of self-attention in combination with fast Fourier transform in a dynamic small-strand pedestrian re-recognition framework based on the graph neural network of the present invention;

fig. 5 is a schematic view of a pedestrian image with a weather scene added thereto according to the present invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

As shown in fig. 1-5, an embodiment of the present invention provides a method for identifying a re-clothing pedestrian based on multilayer dynamic concentration and local pyramid aggregation, including the following steps:

(0) Constructing a monitoring network to acquire pedestrian video data; detecting pedestrians by adopting a target detection algorithm, and then obtaining a pedestrian detection frame by adopting a target tracking algorithm; a pedestrian video sequence cut into 258 x 128 pixel specifications forms an atlas gallery;

(1) Adding a wind and rain scene to the image data set and performing standardized preprocessing and data enhancement operations; adding a weather scene to an image dataset comprises the steps of:

(11) Generating a noise matrix N which is subjected to Uniform distribution within the range of the width w and the height h of the image by using a formula N-uniformity (0, 255), and simulating random scattering effects of raindrops at different positions;

Wherein C represents an image channel, beta is a mixed weight, and N' is a noise matrix after a fuzzy core;

the standardized preprocessing and data enhancement operations include: horizontal overturn, random clipping and random erasing.

(2) Dividing the preprocessed image into N blocks which are consistent in size and non-overlapping, introducing additional learning embedded [ CLS_TOKEN ] as global features for sequence input, and simultaneously endowing each block with position codes [ POS_TOKEN ] to form a sequence input to a transducer model; the method comprises the following steps:

Z₀＝[x_cls;F(x¹);F(x²);...;F(x^N)]+P

(3) Constructing a pedestrian feature extraction network based on a standard transducer architecture, inputting the sequence generated in the step (2), extracting pedestrian features and recording the features of each transducer layer; the method comprises the following steps:

Z^l＝Transformerlayer(Z^l-1),l＝1,2,...,L

The output of each layer of transformers Z ¹,Z²,...,Z^L.

(4) Carrying out dynamic weight adjustment and fusion treatment on each layer of characteristics of the transducer obtained in the step (3) by utilizing a multi-layer dynamic focusing module; the method comprises the following steps:

(5) Selectively extracting and fusing specific layer characteristics in a Transformer network through a local pyramid aggregation module to obtain multi-scale characteristic information, and embedding the multi-scale characteristic information into a self-attention mechanism by adopting fast Fourier transformation; the method comprises the following steps:

The self-attention mechanism of the fast Fourier transform is as follows:

first, the self-attention module receives the input X εR ^B*N*C, where B is the batch size, N is the sequence length, and C is the feature dimension. Second, through three linear layers, input X is converted into query Q, key K, and value V: q=xw ^Q,K＝XW^K,V＝XW^V. Wherein W ^Q,W^K and W ^V are both learnable weight matrices; then, the query, key, and value are split into multiple heads; the Fast Fourier Transform (FFT) algorithm exhibits the best efficiency when the input size is an integer power of 2. Finally, apply the appropriate padding to Q and K, apply FFT to padded Q _padded and K _padded, and estimate their correlation in the frequency domain. The output formula is as follows:

Attn＝Softmax(F^-1(F(Q_padded)⊙F(K_padded))[:,:,:,:Q.size(1)])

Wherein F (·) and F ^-1 (·) represent FFT and inverse FFT, respectively. The dot product of the FFT result is calculated first, and the dot product result is processed with inverse FFT (IFFT) and truncated to the original size. The dot product result obtained in the previous step is normalized by using a softmax function, and the attention weight Attn is obtained. Then, the attention score and the corresponding value vector are weighted and aggregated through dot product operation, and the result is added with the input X to obtain the self-attention output with enhanced characteristics:

Out＝Attn⊙V+X

(6) And (3) applying the characteristic output obtained according to the steps (4) - (5) to a loss function to verify whether the query image and the test image are in the same category, thereby completing training and optimizing the model. The loss function includes: ID loss and triplet loss; the ID loss adopts a traditional cross entropy loss function, and tag smoothing is not included; the formula is as follows:

The triplet loss formula is as follows:

The total loss function formula L is as follows:

The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes any one of the clothing changing pedestrian re-identification methods based on multilayer dynamic concentration and local pyramid aggregation when being loaded to the processor.

Claims

1. The clothing changing pedestrian re-identification method based on multilayer dynamic concentration and local pyramid aggregation is characterized by comprising the following steps of:

(2) Dividing the preprocessed image into Q blocks which are consistent in size and non-overlapping, introducing additional learnable embedded [ CLS_TOKEN ] as global features for sequence input, and simultaneously endowing each block with position codes [ POS_TOKEN ] to form a sequence Z ₀ input to a pedestrian feature extraction network;

(3) Constructing a pedestrian feature extraction network based on a standard fransformer architecture, inputting the sequence generated in the step (2), extracting pedestrian features, and recording output features Z _l, l=1, 2, and L of each fransformer layer; l is the number of layers of a transducer layer included in the pedestrian feature extraction network;

(4) Carrying out dynamic weight adjustment and fusion treatment on the output characteristics of each transducer layer obtained in the step (3) by utilizing a multi-layer dynamic focusing module;

The step (4) comprises the following steps:

(41) Constructing a weight vector W= { W ₁,w₂,...,w_L }, wherein W _i is the weight of the feature output by the ith layer of the Transformer layer in the pedestrian feature extraction network; weighting each layer of the Transformer layer by utilizing orthogonality constraint weighting; the specific weighted calculation formula is as follows:

Wherein g _i represents the feature importance of the i-th layer, initialized to a uniform value on all layers; beta and gamma are learnable parameters; < Z ⁱ,Z^j > represents the inner product between the output features of the i-th layer and j-th layer of the convertors, which is a measure of their output feature correlation; alpha is a regularization coefficient;

where λ is a non-negative regularization parameter used to mitigate overfitting by limiting the magnitude of the weights within the model; the Frobenius norm of the weight vector W is used for calculating the square sum of all the weights of the Transformer layers;

(5) Selectively extracting and fusing the output characteristics of a specific Transformer layer in the pedestrian characteristic extraction network through a local pyramid aggregation module to obtain multi-scale characteristic information, and embedding the multi-scale characteristic information into a self-attention mechanism by adopting fast Fourier transformation;

The step (5) is specifically as follows:

In the local pyramid aggregation module, selecting output features f ₁,f₂,f₃,f₄ of four different convertors layers as input, and performing three-layer pyramid feature aggregation operation, wherein each feature aggregation operation comprises the steps of connecting self-attention outputs obtained by respectively performing convolution block calculation on two inputs, and inputting the self-attention outputs into the same convolution block to obtain fused features; three outputs are finally obtained through the local pyramid aggregation module;

The convolution block is calculated specifically as follows: first, a1×1 convolution layer is used; secondly, feature dimensions are adjusted and nonlinearities are introduced using BacthNorm D and ReLU functions; then, adding a self-attention mechanism of fast Fourier transformation to obtain self-attention output with enhanced characteristics;

2. The method for re-identifying a dressing pedestrian based on multi-layer dynamic concentration and local pyramid aggregation according to claim 1, wherein the step (1) of adding a weather scene to the image dataset comprises the steps of:

(12) Applying fuzzy processing to the noise matrix through a formula N' =n×k to generate a raindrop effect without a specific direction; where K represents a predefined fuzzy kernel, representing a two-dimensional convolution operation;

Wherein I _C represents the original image, β is the mixing weight, and N "is the noise matrix after blurring kernel.

3. The method for re-identifying a clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation according to claim 1, wherein the standardized preprocessing and data enhancement operations in step (1) comprise: horizontal overturn, random clipping and random erasing.

4. The method for re-identifying the clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation according to claim 1, wherein the step (2) is specifically as follows:

First, the image is partitioned into Q non-overlapping blocks, denoted { x ⁱ |i=1, 2, …, Q }; secondly, introducing an additional learning embeddable x _cls as a feature representation of the aggregation at the beginning of the input sequence; then, adding a position code P behind the feature vector of each image block; finally, the input sequence transmitted to the transducer layer is formulated as:

Z₀＝[x_cls;F(x¹);F(x²);...;F(x^Q)]+P

Wherein Z ₀ represents an input sequence; p ε R ^(Q+1)*D represents the location embedding; f is a linear projection function that maps the image to D-dimensional space.

5. The method for re-identifying the clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation according to claim 1, wherein the step (3) is specifically as follows:

The input sequence Z ₀ is input into the pedestrian feature extraction network for processing, each layer refines the features and integrates the context information through a multi-head self-attention mechanism, and the output features Z ^l of the first layer can be calculated by the following modes:

Z^l＝Transformerlayer(Z^l-1),l＝1,2,...,L

Wherein TransformerLayer represents a layer in the standard transducer architecture;

The output features of each transducer layer constitute { Z ¹,Z²,...,Z^L }.

6. The method for re-identifying a clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation according to claim 1, wherein the step (6) loss function comprises: ID loss and triplet loss; the ID loss adopts a traditional cross entropy loss function, and tag smoothing is not included; the formula is as follows:

Wherein B is the class number, y _i is the one-hot code of the real label, and p _i is the probability that the model prediction sample belongs to the ith class;

the triplet loss formula is as follows:

wherein d (ap) and d (an) each represent an anchor sample And positive samples/>And negative sample/>A distance therebetween; the hyper-parameter m serves as a lower limit for the distance between positive and negative pairs of samples:

The total loss function formula L is as follows:

Wherein initially the loss of each output is set to an equal weight, denoted u _i, where i = 0,1,2,3; then dynamically adjusting the weight of each part through a back propagation algorithm in the training process;

7. The method for re-identifying a clothing changing pedestrian based on multilayer dynamic concentration and local pyramid aggregation according to claim 1, further comprising the steps of: (0) constructing a monitoring network to obtain pedestrian video data; detecting pedestrians by adopting a target detection algorithm, and then obtaining a pedestrian detection frame by adopting a target tracking algorithm; pedestrian video sequences clipped to 258 x 128 pixel specification constitute an image dataset.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when loaded into the processor implements a method for re-identification of a clothing change pedestrian based on multi-layer dynamic concentration and local pyramid aggregation according to any one of claims 1-7.