CN117011883A

CN117011883A - Pedestrian re-recognition method based on pyramid convolution and transducer double branches

Info

Publication number: CN117011883A
Application number: CN202310551328.1A
Authority: CN
Inventors: 陈斌; 陈玉; 王琳泉; 刘浩然; 韩旭彤
Original assignee: Shenyang University of Chemical Technology
Current assignee: Shenyang University of Chemical Technology
Priority date: 2023-05-16
Filing date: 2023-05-16
Publication date: 2023-11-07

Abstract

The invention discloses a pedestrian re-recognition method based on pyramid convolution and a transducer double branch, and relates to a pedestrian re-recognition method. Comprising the following steps: and a pyramid convolution module, a transducer branch consisting of N repeated transducers, and a bidirectional feature fusion framework for fusing local features and global features. The pyramid convolution module is better in extracting the local features of pedestrian information, the transducer branches extract the global features of pedestrian images, the features extracted by the two branches are processed by the full-connection layer, the BN layer and the ReLu layer before the classifier processes the features, the image feature extraction is completed, and the features fused by the two branches are used for conducting gallery retrieval on the pedestrian images during testing. And finally, sorting the similarity of the images by using a method for calculating cosine similarity. The pedestrian recognition method and the pedestrian recognition system have the advantages that the pedestrian is characterized by richer features, so that the accuracy of pedestrian re-recognition is improved.

Description

Pedestrian re-recognition method based on pyramid convolution and transducer double branches

Technical Field

The invention relates to a pedestrian re-recognition method, in particular to a pedestrian re-recognition method based on pyramid convolution and a transducer double branch.

Background

Pedestrian re-recognition is a technique that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video sequence, i.e., by identifying the same person from a plurality of monitored pedestrian images. With the popularization of large-scale high-definition cameras and the development of high-speed communication networks, pedestrian re-identification technology is widely applied to the fields of intelligent person searching systems, intelligent security, automatic driving, public security departments, social security protection and the like. The phenomena of illumination difference, attitude change, visual angle change, shielding, background noise and the like of the pedestrian image under video monitoring bring certain challenges to the identification, so that the pedestrian characteristics with robustness are extracted to judge whether the two image characteristics belong to the same individual, and the problem of criticality of pedestrian re-identification is solved.

The traditional pedestrian re-recognition technology relies on manual design features and distance measurement to judge whether the same pedestrian is under the cross-equipment, and is difficult to meet the pedestrian re-recognition task of a complex monitoring scene due to complex manual feature extraction and limited feature extraction capability. With the development of deep learning in recent years, the convolutional neural network has strong feature extraction capability, can automatically extract features from original image data according to task requirements, and achieves remarkable effects in pedestrian re-recognition application.

At present, the research of pedestrian re-identification technology is mainly based on two modes of characterization learning and metric learning. The characterization learning method does not directly consider the similarity between pictures when training the network, but takes the task of re-identifying pedestrians as a classification problem or a verification problem. Unlike token learning, metric learning aims at learning the similarity of two pictures over the network. The similarity of different pictures of the same pedestrian is larger than that of different pictures of different pedestrians. Finally, the loss function of the network makes the distance of the same pedestrian picture (positive sample pair) as small as possible and the distance of different pedestrian pictures (negative sample pair) as large as possible, and the common measurement learning loss methods comprise contrast loss, triplet loss, quadruple loss and the like. Both token learning and metric learning are learning based on image features extracted by the network model.

The Convolutional Neural Network (CNN) based on deep learning relies on superposition of multi-layer nonlinear transformation to extract deep features of images, and is applied to the field of pedestrian re-recognition, a complete end-to-end pedestrian re-recognition model is mainly built based on global features, shallow features, middle features and deep features of pedestrian images are sequentially extracted through deep building of a network model, and accuracy of the pedestrian re-recognition model is improved to a certain extent and great progress is achieved. However, since the features extracted by the CNN model are limited by convolution operation, the features are good at extracting local features, global features of an image are difficult to capture, and meanwhile, the context information is lost due to downsampling operation, so that the relevance between the features of the image cannot be effectively mined, the spatial resolution of an output feature map is reduced, and the rendering under a complex scene is poor.

The transducer network is initially applied to the natural language processing field, and researchers propose to extend the transducer into a computer vision task, and bring new research directions for the research of the CV field. For the first time a pure transducer vision model was proposed, the concept of image blocks (patches) was introduced for the purpose of converting images into sequence data that can be processed by the transducer structure. The learner puts forward a ReID framework TransReID based on a pure transform network structure, the framework physically cuts a picture into a plurality of picture small blocks by means of a horizontal slicing concept, each picture small block extracts respective visual characteristics through a characteristic extraction module, an area jigsaw module (Jigsaw Patch Module, JPM) is designed to improve the robustness and discrimination capability of the model, an auxiliary information coding module (Side Information Embeddings, SIE) is put forward to code external information, and CNN performance is exceeded on a plurality of ReID reference data sets.

Due to the influence of external factors and other problems, the pedestrian re-recognition technology still faces a great challenge, and the problem that pedestrian images in a real scene are insufficient in distinguishing performance of pedestrian features due to factors such as shielding, posture and visual angle changes is to be solved.

Disclosure of Invention

The invention aims to provide a pedestrian re-recognition method based on pyramid convolution and a transducer double branch, the multi-head attention in the transducer can solve the problem of long-distance dependence, so that a model is more focused on different parts of a human body, meanwhile, the downsampling operation is removed, more detailed information can be reserved, local features of pedestrian information can be extracted in CNN branches at a higher reasoning speed, the representation capability of global information is enhanced due to complex spatial transformation and long-distance dependence of cascaded transducer blocks, the local features and global representation of images are reserved to the greatest extent by the design of concurrent structures, and the pedestrian feature extraction capability is improved. The method comprises the steps of obtaining a network model of target pedestrian image global information and local fine granularity information, training a network by using a cross entropy loss function improved by tag smoothing regularization, and obtaining richer features to represent pedestrians, so that the accuracy of pedestrian re-identification is improved.

The invention aims at realizing the following technical scheme:

the invention has the advantages and effects that:

according to the invention, a pedestrian re-recognition method based on a CNN and Transformer parallel double-branch architecture is adopted, the local characteristics of pedestrian information can be extracted in CNN branches at a relatively high reasoning speed, and the cascaded Transformer blocks can enhance the representation capability of global information due to the complex spatial transformation and long-distance dependence characteristics of the cascaded Transformer blocks, so that the design of concurrent structures can reserve the local characteristics and global representation of images to the greatest extent. Before the classifier processes it, the fully connected layer, BN layer and ReLu layer are added. And when training, the CNN branches and the Transformer branches are supervised by using a cross entropy loss function with improved label smoothness, image feature extraction is completed, and when testing, the pedestrian images are subjected to gallery retrieval by using features fused by the two branches, so that generalization of the model is enhanced. The pedestrian re-recognition network model designed by the invention has higher accuracy through the experimental results of training and testing on the Market1501 and DukeMTMC-reiD data sets.

Drawings

FIG. 1 is a block diagram of a pedestrian re-recognition function;

FIG. 2 is a diagram of a pyramid convolution and transducer dual-branch network model architecture of the inventive design;

FIG. 3 is a diagram of a pyramid convolution network;

FIG. 4 is a diagram of a branch network architecture of a transducer model;

FIG. 5 is a CFT feature architecture diagram;

fig. 6 is a TFC feature architecture diagram;

fig. 7 is a model training flowchart.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

The invention discloses a pedestrian re-identification method based on pyramid convolution and a transducer double-branch architecture, which comprises the following steps:

1. using a Market1501 and DukeMTMC-reiD data set in the pedestrian re-identification data set;

2. the calculation platform used in the experiment is a Windows10 operating system, GPU is used for operation acceleration, a software environment adopts Python3.8 and Pytorch deep learning frames to build a model, an SGD optimizer is adopted to optimize model parameters, the momentum is set to 0.9, and the model construction is completed in the environment;

3. in the CNN branch, a network model takes a ResNet50 network as a basic network, in order to capture the fine change among different pedestrian images, pyramid convolution (PyConv) is used for replacing standard convolution to extract multi-scale features of the pedestrian images, each layer of the pyramid convolution unit comprises convolution kernels of different types, the depth of the pyramid convolution unit correspondingly decreases along with the increase of the size of the convolution kernels, and the detail information of different scales can be extracted just by the difference of the size and the depth. In order to adapt to convolution kernels with different depths, input feature images are divided into a plurality of groups through grouping convolution, feature extraction is carried out on the feature images input by each group by using the convolution kernels with different depths, and meanwhile, the PyConv is used, so that extra calculated amount and parameter amount are not increased;

4. in the transducer branch, a ViT model architecture is adopted by the transducer branch, and the transducer branch consists of N repeated Transformer block, wherein each transducer is mainly composed of a Multi-head self-attention module (Multi-headself attention, MSA), a Multi-layer perceptron (MLP) and two LayerNorm layers, and the LayerNorm layers are respectively applied before each MSA and each MLP to improve the convergence speed of the model.

5. In order to enhance the global perceptibility of extracted local features in CNN branches and the local details of the global representation of the transducer branches, the feature fusion architecture fuses the features extracted by the two branches in an interactive manner.

6. Taking the preprocessed training data set as the input of the network and training, loading the pre-training weight of the network model, and optimizing the network model by using the cross entropy loss function after label smoothing improvement;

7. to measure the performance of the algorithm, two indexes of a first hit rate (Rank-1) and average accuracy (mAP) are used on two pedestrian recognition data sets of the mark 1501 and the DukeMTMC-reID to measure the effectiveness of the network model.

The specific implementation steps of the step 1 are as follows:

the mark 1501 dataset was collected from 6 cameras in a university campus of bloom, containing 32668 pedestrian pictures of 1501 identities. Wherein the training set contains 12936 pictures of 751 identities, the test set contains 19732 pictures of 750 identities, and the identities of the training set and the test set are different. Images of the DukeMTMC-reID dataset were acquired at 8 different cameras in the ducer university campus, with the training set containing 16522 images of 702 identities, 2228 of 702 identities as search pictures, and 17661 pictures of 1110 identities as the searched picture set.

The naming format of the pictures in the dataset is: 0001_c1s1_000151_00.Jpg;

the size of the inputted pedestrian picture is 224x224, and the number of channels is 3;

the specific implementation steps of the step 2 are as follows:

importing a configuration environment required by a model;

the initial learning rate of the SGD in the model is set to be 0.005, dropout=0.5, P identities and K images in the pedestrian data set are extracted each time and are used as training samples to be sent into the network model, P=16, K=4 and batch number B=32 are set, and epoche=100 is set;

the specific implementation steps of the step 3 are as follows:

the invention designs a pedestrian re-recognition network model based on a pyramid convolution and transform double-branch mixed architecture, before a feature map is sent into double branches, the feature map is firstly subjected to 7×7 convolution with the step length of 2 and a maximum pooling layer of 3×3 and is used for extracting initial shallow local features, such as edge and texture information and the like, and the calculation formula is as follows:

F(x)＝MaxPool(COnv(x)) (1)

where x is the feature map in the input network model, conv is the 7×7 convolutional layer, maxPool is the 3×3 max pooling layer. The CNN branch takes ResNet50 as a reference network, extracts local characteristics of pedestrian images, and replaces partial convolution blocks in the ResNet50 by an improved pyramid convolution unit in order to further acquire rich multi-scale characteristics of pedestrians;

and monitoring CNN branches and Transformer branches by using a cross entropy loss function with improved label smoothness during training to finish image feature extraction, and performing gallery retrieval on pedestrian images by using features fused by the two branches during testing. And finally, sorting the similarity of the images by using a method for calculating cosine similarity.

The specific implementation steps of the step 4 are as follows:

for the input three-dimensional matrix diagram [ H, W, C ], the image x is first cut into n=h×w/S2 non-overlapping image blocks by a convolution kernel, and simultaneously encoded into a sequence vector xm= { m1, m2, … mn } acceptable by a transform, where S represents the size of the convolution kernel, and the convolution kernel size is set to 4×4 in the present algorithm, i.e., s=4. In order to classify the subsequent output sequence vectors, a leachable embedded vector class token is introduced into the transducer, the ebedding in the class token is randomly initialized during training and added with the input sequence vectors of the images, and the sequence vectors in the final input transducer are x' m= { mclass, m1, m2, … mn }, and the length is n+1.

Since Multi-head self-entry requires extracting global information and local information contained in the input sequence by means of position information in the sequence. No information of the position is contained, but self-entry does not contain position information, which is an important information for the sequence. The transducer therefore incorporates Positional Encoding to add location information.

The specific implementation steps of the step 5 are as follows:

the dimension of the feature map in CNN is CxH xW, the shape of patch subedding in the transform is (m+1) xN, where m is the number of image blocks, 1 represents class token category, and N represents subedding dimension. The local features collected in the CNN branches and the global features in the transformers cannot be directly fused under the limitation of feature dimensions, and a bidirectional feature fusion architecture is designed. When the feature map in the CNN branch is transferred to the transducer branch, the transducer feature fusion architecture (CNN Fuse Transformer, CFT) uses 1×1 convolution to align the number of channels of Patch email, uses an average pooling Layer to complete the alignment of the spatial dimension, uses Patch email to convert the original 2-dimensional image into a series of 1-dimensional Patch email, and uses Layer Norm to regularize the features, because the deeper the CNN extraction Layer, the richer the obtained semantic information and the leaner the spatial information, so that the important information on the spatial position is extracted through a spatial attention mechanism.

When a transducer branch is fused to a CNN branch, a CNN feature fusion architecture (TFC) is used, the patch patterns are aligned with spatial dimensions through upsampling, and the patch patterns are aligned with feature map dimensions in the convolution by adopting 1X1 convolution and fused with the feature map in the CNN.

The specific implementation steps of the step 6 are as follows:

and respectively performing supervision training on the CNN branch and the Transformer branch by using a cross entropy loss function under label smoothing regularization. The cross entropy loss function is used as the most common classification loss function in the pedestrian re-identification field, and the calculation formulas are shown as (2) and (3), wherein N is the total number of pedestrian identities, and p _i And predicting the probability for the identity of the pedestrian, wherein y is the true label of the identity of the pedestrian.

The cross entropy loss function in the classification task is easy to be over-fitted in the model training process, so that the cross entropy loss function is improved through label smoothing, and a calculation formula (4) is an improved cross entropy loss target one-hot label:

where ε is a small super-parameter, where ε=0.1, the cross entropy loss function after tag smoothing is introduced is L _isr . In summary, the loss function constructed by the invention is shown in (5):

L _total ＝L _isr +L _isr (5)

the specific implementation steps of the step 7 are as follows:

sending the preprocessed training set and the preprocessed testing set into a network to respectively train and test the model;

the pedestrian re-recognition algorithm aims at finding pedestrians which are most similar to the target to be queried in pedestrian images shot under different cameras, and can be regarded as a sorting problem. Rank-1 indicates the accuracy rate of the first picture in the ordered list belonging to the same identity as the picture to be queried.

mAP is obtained by summing average precision and averaging, and the calculation formula is as follows:

wherein the method comprises the steps ofFor the average accuracy of the class, c is the total number of classes.

The invention provides a pedestrian re-recognition technical method based on residual double-channel attention and multi-scale feature fusion, aiming at the problem of global weakening caused by pedestrian gesture change and attention mechanism in pedestrian re-recognition. The complete pedestrian recognition task is carried out once, and the following steps are needed: the method comprises the steps of (1) inputting a pedestrian dataset (2), preprocessing an input dataset picture (3), extracting image features through a network feature extraction module (4), carrying out feature fusion on feature graphs extracted from a CNN branch and a Transformer branch (5), identifying a pedestrian target image through a loss function optimization model (6), and outputting a result. The pedestrian re-recognition function module is shown in fig. 1. The network architecture in each functional module will be described in detail based on a complete recognition task

A data set input module: 32 pedestrian images are randomly extracted from the pedestrian data set each time and are used as training samples to be sent into a training network.

An image preprocessing module: before image feature extraction, a preprocessing operation, i.e., data enhancement, is required for the picture. Including normalization, random horizontal flipping, random clipping, random erasing, and the like. Wherein the probability of random horizontal flip of each image is set to 0.5, each image is decoded to a 32-bit floating point original pixel value in [0,1], the data is normalized by subtracting 0.485, 0.456, 0.406 and dividing by 0.229, 0.224, 0.225 respectively, and the convergence rate of the model is improved. After preprocessing of the data set is completed, the model is built and the running environment is configured. And (3) adopting Pycharm as an integrated development environment of the project, constructing a model by using a Pytorch deep learning framework, importing conda into the project, and accelerating by using a GPU. Setting parameters and constructing a model after the environment configuration is completed.

And the feature extraction module is used for: the pyramid convolution and transform double-branch pedestrian re-identification network model designed by the invention uses the pyramid convolution block to replace the traditional ResNet50 residual block in the CNN branch to extract the multi-scale characteristics of pedestrians, and the multi-head attention in the transform branch can solve the problem of long-distance dependence, so that the model is more focused on different parts of a human body, and meanwhile, the downsampling operation is removed, so that more detailed information can be reserved. Wherein the network model architecture is shown in fig. 2. The pyramid convolution network structure is shown in fig. 3. The architecture of the branch network of the transducer model is shown in figure 4. The bi-directional feature fusion architecture is shown in fig. 5. The pedestrian pictures after the image preprocessing module are sent into a network model, and the salient features of the images are extracted through a pyramid convolution layer, a transducer branch and a feature fusion framework in sequence.

And a feature fusion module: considering the advantages and disadvantages of CNN and Transformer in feature extraction, feature fusion is performed in an interactive manner by using a feature fusion architecture, so that the global perceptibility of local features and the local details of global representation can be greatly enhanced. Due to the variability in feature fusion between CNN and transducer, 1x1 convolution was used to align the feature values by LayerNorm and batch norm using an up-sampling, down-sampling strategy for its channel size.

And a calculation loss module: in the training phase of the model, a fully connected layer, a BN layer and a ReLu layer are added before the classifier processes the model. The CNN branches and the transfomer branches are supervised using a cross entropy loss function with label smoothing improvement at training, and features extracted at the full connection layer are used to calculate cross entropy loss. Judging whether the feature extraction module converges or not according to the output result of the loss function, if so, sending the feature vector to the identification module, and if not, carrying out gradient back propagation on the loss result, updating the attention mechanism network parameters until convergence, and obtaining model weights conforming to the system, wherein the training process of the model is shown in figure 6.

And a testing module: after the input image is subjected to model training through the module, a training result is required to be tested, in the test module, the Euclidean distance between the designated object in the query set and each object in the candidate set is calculated, and then the calculated distances are sequenced in ascending order, so that a sequencing result of pedestrian re-identification is obtained. Performance metrics such as initial hit rate (Rank-1) and average accuracy (mAP) are typically used to determine the model training situation.

Claims

1. The pedestrian re-identification method based on pyramid convolution and transform double branches is characterized by comprising the steps of extracting local features by a concurrent double-branch architecture of the method, establishing global information removal modeling, and carrying out feature fusion on the features extracted by the two branches through feature fusion, and is characterized by comprising the following steps:

step 1: extracting image features by utilizing the constructed pedestrian re-recognition network model aiming at the input data set;

step 2: respectively extracting distinguishing features of the pedestrian images by the pyramid convolution branches and the transducer branches;

step 3: feature fusion is carried out on the features extracted by the two branches;

step 4: training the CNN branch and the Transformer branch in the pedestrian re-recognition network model in the step 1 by adopting the cross entropy loss function after the label smoothing improvement to obtain the optimal parameters of the pedestrian re-recognition network model;

step 5: and aiming at the query set and the candidate set contained in the pedestrian re-recognition public data set, calculating the Euclidean distance of each object in the designated object and the candidate set in the query set, and then carrying out ascending order sequencing on the calculated distances to obtain the sequencing result of pedestrian re-recognition.

2. The pedestrian re-recognition method based on pyramid convolution and transform double branches according to claim 1, wherein the specific implementation steps of the step 1 are as follows:

step 1.1: in order to enrich the diversity of data and improve the generalization capability of a model, firstly, preprocessing operation is carried out on a picture data set input into the model, wherein the preprocessing operation comprises normalization, random horizontal overturn, random cutting, random erasing and other operations; sending the picture data set subjected to the preprocessing operation into a pedestrian re-identification network model;

step 1.2: the pedestrian re-recognition network model is a pyramid convolution and Transformer double-branch network architecture, and the double-branch architecture mainly comprises CNN branches and Transformer branches; the CNN branch takes ResNet50 as a reference network, extracts local characteristics of pedestrian images, and replaces partial convolution blocks in the ResNet50 by an improved pyramid convolution unit in order to further acquire rich multi-scale characteristics of pedestrians; the transducer branch adopts a typical ViT structure and consists of N repeated transducers to extract the global feature representation of the pedestrian.

3. The pedestrian re-recognition method based on pyramid convolution and transform double branches according to claim 1, wherein the specific steps of the step 2 are as follows:

step 2.1: to capture subtle changes between different pedestrian images, multi-scale features of the pedestrian images are extracted using a pyramidal convolution (PyConv) instead of a standard convolution; pyConv is a pyramid convolution unit, each layer of the PyConv comprises convolution kernels of different types, the depth of the PyConv is correspondingly reduced along with the increase of the size of the convolution kernels, and the PyConv is just different in size and depth, so that detail information of different scales can be extracted; in order to adapt to convolution kernels with different depths, input feature images are divided into a plurality of groups through grouping convolution, and feature extraction is carried out on the feature images input by each group through the convolution kernels with different depths; before the feature map is sent to the double branches, the feature map is firstly subjected to 7×7 convolution with a step length of 2 and a maximum pooling layer of 3×3 for extracting initial shallow local features; the calculation formula is as follows:

F(x)＝MaxPool(Conv(x))

wherein x is a feature map input into a network model, conv is a convolution layer of 7×7, and MaxPool is a maximum pooling layer of 3×3;

step 2.2: for pyramid convolution branches, convolution kernels of 9×9, 7×7, 5×5 and 3×3 are used to replace convolution kernels of original size 3×3 in the residual structure, and the input feature map is divided into 16 groups, 8 groups, 4 groups and 1 group in sequence, the feature extracted from the input feature vector by the pyramid convolution branch is F (M), and the feature vector extracted from the pyramid convolution branch is as follows:

F(M)＝PyConv(F(x))

wherein PyConv represents a pyramid convolution;

step 2.3: in the transducer branch, for the input three-dimensional matrix diagram [ H, W, C ], firstly, cutting the image x into n=h×w/S2 non-overlapping image blocks through a convolution kernel, and simultaneously performing encoding into a sequence vector xm= { m1, m2, … mn } acceptable by the transducer, wherein S represents the size of the convolution kernel, and setting the size of the convolution kernel to be 4×4 in the algorithm, namely, s=4; in order to classify the subsequent output sequence vector, a leachable embedded vector class token is introduced into the Transformer, the ebedding in the class token is randomly initialized and added with the input sequence vector of the image during training, and the sequence vector in the final input Transformer is x' m= { mclass, m1, m2, … mn }, and the length is n+1; the feature extracted by the input feature vector through the transducer branch is F (N), and the feature vector extracted by the transducer branch is as follows:

F(N)＝Transformer(F(x))。

4. the pedestrian re-recognition method based on pyramid convolution and transform double branches according to claim 1, wherein the specific implementation steps of the step 3 are as follows:

step 3.1: the dimension of the feature map in CNN is C×H×W, the shape of patch subedding in the transducer is (m+1) ×N, wherein m is the number of image blocks, 1 represents a classken class, and N represents an subedding dimension; the local features collected in CNN branches and global features in a transformer cannot be directly fused under the limitation of feature dimensions, and a bidirectional feature fusion architecture is designed; when a feature map in a CNN branch is transferred to a transducer branch, a transducer feature fusion architecture (CNNFuseTransformer, CFT) extracts important information on a spatial position through a spatial attention mechanism firstly because the deeper the CNN extracted layer is, the more abundant the obtained semantic information is, and the more lean the spatial information is, and adopts 1X1 convolution to align the number of channels of patchebedding, uses an average pooling layer to complete the alignment of spatial dimensions, uses patchEmbedding to convert an original 2-dimensional image into a series of 1-dimensional patchebeddings, and uses LayerNorm to regularize the features; the formula when the CNN branch performs feature fusion to the transducer branch is as follows:

F(y)＝Concat(F(m)，F(n))

wherein F (y) is a feature vector after the CNN branch is fused to the transducer branch, and F (m) and F (n) are features extracted by the CNN branch and features extracted by the transducer branch respectively;

step 3.2: when a transducer branch is fused to a CNN branch, a CNN feature fusion architecture (TFC) is adopted, the patch patterns are aligned with the space scale through upsampling, and the patch patterns are aligned with the feature map dimension in the convolution by adopting 1X1 convolution and fused with the feature map in the CNN; the formula when the transducer branch performs feature fusion to the CNN branch is as follows:

F(y′)＝Concat(F(m)，F(n))

wherein F (y') is a feature vector fused from a transducer branch to a CNN branch, and F (m) and F (n) are features extracted from the transducer branch and features extracted from the CNN branch respectively;

step 3.3: the characteristics after the two branches are fused are used for conducting gallery retrieval on pedestrian images during testing; the feature vector after the fusion of the CNN branch and the transducer branch is F (Y), and the expression is as follows:

F(Y)＝F(M)+F(N)。

5. the pedestrian re-recognition method based on pyramid convolution and transform double branches according to claim 1, wherein the specific implementation of the step 4 comprises the following steps:

in the training stage of the model, monitoring and training CNN branches and Transformer branches respectively by using a cross entropy loss function under label smoothing regularization; the cross entropy loss function is used as the most common classification loss function in the pedestrian re-recognition field, and the calculation formula is shown as follows, wherein N is the total number of pedestrian identities, and p _i The probability is predicted for the identity of the pedestrian, and y is the true label of the identity of the pedestrian;

the cross entropy loss function in the classification task is easy to be over-fitted in the model training process, so that the cross entropy loss function is improved through label smoothing, and a calculation formula is an improved cross entropy loss target one-hot label:

wherein epsilon is a smaller super parameter, epsilon=0.1 in the experiment, and the cross entropy loss function after label smoothing is introduced is L _isr The method comprises the steps of carrying out a first treatment on the surface of the In summary, the loss function constructed by the invention is shown in (5):

L _total ＝L _isr +L _isr

6. the pedestrian re-recognition method based on pyramid convolution and transform double branches according to claim 1, wherein the specific implementation of the step 5 comprises the following steps:

aiming at a query set and a candidate set contained in the pedestrian re-recognition public data set, calculating the Euclidean distance of each object in the designated object and the candidate set in the query set, and then carrying out ascending order sequencing on the calculated distances to obtain a sequencing result of pedestrian re-recognition; and judging the training condition of the model by adopting a first hit rate (Rank-1) and an average accuracy (mAP) performance index.