CN115909408A

CN115909408A - Pedestrian re-identification method and device based on Transformer network

Info

Publication number: CN115909408A
Application number: CN202211535684.6A
Authority: CN
Inventors: 王润民; 朱祯琳; 朱彦斌; 陈华; 朱桂林; 黑洁蕾; 罗雨薇; 丁亚军; 钱盛友; 代建华
Original assignee: Hunan Normal University
Current assignee: Hunan Normal University
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-04-04

Abstract

The invention provides a pedestrian re-identification method and a pedestrian re-identification device based on a Transformer network, which are characterized in that an original image is divided into two branches; respectively carrying out linear mapping on the two branches to obtain a first sequence and a second sequence; adding new parameters into the first sequence and the second sequence to generate a third sequence and a fourth sequence; putting the third sequence and the fourth sequence into corresponding different levels in a Transformer network for feature extraction, thereby obtaining a first global feature and a first local feature; performing feature fusion on the first global feature and the first local feature to obtain a second global feature and a second local feature; and processing the second local features, respectively putting the processed second local features and the processed second global features into a specific hierarchy of a Transformer network for feature extraction, and respectively calculating the overall loss according to the extracted features and corresponding loss functions. The method can effectively improve the accuracy and robustness of the pedestrian re-identification task. The device also has the beneficial effects.

Description

Pedestrian re-identification method and device based on Transformer network

Technical Field

The invention relates to the technical field of network image processing, in particular to a pedestrian re-identification method and device based on a transform network.

Background

With the development of economy and the improvement of science and technology, pedestrian re-identification is a research hotspot in the field of intelligent video analysis, and is widely regarded by academia. The pedestrian re-identification is to judge whether a target person exists in an image or a video according to the image or the video under different cameras or monitoring by using a computer vision technology. However, due to different cameras or different monitored camera resolutions or due to a series of objective factors such as angle, illumination change and the like, it is generally difficult to obtain high-quality pictures, and the pedestrian re-identification technology faces a great challenge.

Currently, with the development of deep learning, a CNN convolutional neural network is used as a very common network for extracting characterization features for pedestrian re-identification, but since a general CNN only focuses on extraction of local features, global features are particularly important for the task of pedestrian re-identification. While Transformer is extended from NLP and takes a whole sequence as input, transformer pays more attention to the extraction of global features than CNN, and Transformer network is a common choice for feature extraction of pictures or videos. However, in fact, in the task field of pedestrian re-identification, the information extracted by the global features and the local features is very important to the accuracy and the robustness of the task, and the tendency of the existing Transformer network feature extraction is not suitable for improving the accuracy and the robustness of the task of pedestrian re-identification.

Therefore, those skilled in the art need to solve the above problems by providing a method and an apparatus for re-identifying a pedestrian based on a Transformer network, which can effectively improve the accuracy and robustness of the task of re-identifying the pedestrian compared to the CNN and the conventional Transformer network.

Disclosure of Invention

The invention aims to provide a pedestrian re-identification method and a pedestrian re-identification device based on a Transformer network. The device also has the above-mentioned advantageous effects.

Based on the above purposes, the technical scheme provided by the invention is as follows:

a pedestrian re-identification method based on a Transformer network comprises the following steps:

dividing the original image to obtain a first branch and a second branch;

linearly mapping the first branch and the second branch to obtain a first sequence and a second sequence;

adding parameters to the first sequence and the second sequence respectively to obtain a third sequence and a fourth sequence;

putting the third sequence and the fourth sequence into a transform network to perform feature extraction so as to obtain a first global feature and a first local feature;

feature fusing the first global feature and the first local feature to obtain a second global feature and a second local feature;

respectively putting the processed second local features and the processed second global features into a specific level of a Transformer network for feature extraction, and respectively calculating according to corresponding loss functions to obtain the overall loss;

the third sequence is the first sequence after the parameter is added, and the fourth sequence is the second sequence after the parameter is added.

Preferably, before the dividing the original image to obtain the first branch and the second branch, the method further comprises the following steps:

presetting a first patch and a second patch;

inputting the original image;

wherein the first blob size is larger than the second blob size.

Preferably, the parameters are specifically: auxiliary information and position-coding information.

Preferably, the step of putting the third sequence and the fourth sequence into a transform network for feature extraction based on the number of corresponding layers to obtain the first global feature and the first local feature specifically includes the following steps:

acquiring a first layer number corresponding to the third sequence according to an ablation experiment;

acquiring a second layer number corresponding to the fourth sequence according to the ablation experiment;

putting a third sequence into the first layer number of the Transformer network, putting a fourth sequence into the second layer number of the Transformer network, and respectively performing feature extraction to obtain a first global feature and a first local feature of the third sequence and a first global feature and a first local feature of the fourth sequence;

the feature extraction is based on the encoder and decoder in the transform network to carry out feature interaction.

Preferably, the feature fusing the first global feature and the first local feature to obtain a second global feature and a second local feature specifically includes the following steps:

and respectively putting the first global feature of the third sequence and the first local feature of the fourth sequence into a cross attention network for feature fusion to obtain a second global feature and a second local feature.

Preferably, before the feature fusing the first global feature and the first local feature to obtain a second global feature and a second local feature, the method further comprises the following steps:

in the cross attention network, full function nodes are replaced with mapping and demapping relationships.

Preferably, before the second local feature and the second global feature after processing are respectively placed into a transform network specific hierarchy for feature extraction, and the overall loss is calculated and obtained according to the corresponding loss function, the method further includes the following steps:

performing a shuffle operation on the second local features;

the shuffling operation specifically comprises the following steps: scrambling the side information and the position-coding information in the second local feature.

Preferably, the specific level is embodied as the last level in the Transformer network.

Preferably, the step of respectively putting the processed second local features and the second global features into a transform network specific level for feature extraction, and respectively calculating and obtaining the overall loss according to the corresponding loss function includes the following steps:

respectively putting the second local feature and the second global feature after the shuffling operation into the last layer of a Transformer network for feature extraction so as to obtain a third global feature and a third local feature;

acquiring a first Loss according to the third global feature and a preset Loss function ID Loss;

acquiring a second Loss according to the third local feature and a preset Loss function Triple Loss;

obtaining the overall loss according to the first loss and the second loss;

wherein the overall loss is specifically an average of the first loss and the second loss.

A pedestrian re-identification device based on a Transformer network comprises:

the dividing module is used for dividing the original image into a first branch and a second branch;

a mapping module for linearly mapping the first branch and the second branch to obtain a first sequence and a second sequence;

a parameter adding module, configured to add parameters to the first sequence and the second sequence, so as to convert the first sequence and the second sequence into a third sequence and a fourth sequence;

a feature extraction module, configured to extract a first global feature and a first local feature of the third sequence and the fourth sequence, respectively;

a feature fusion module for fusing the first global feature and the first local feature to obtain a second global feature and a second local feature;

the processing module is used for processing the second local features;

and the calculation module is used for calculating and acquiring the overall loss according to the processed second local feature, the processed second global feature and the corresponding loss function.

The invention provides a pedestrian re-identification method based on a Transformer network, which is characterized in that an original image is divided into two branches; respectively carrying out linear mapping on the two branches to obtain a first sequence and a second sequence; adding new parameters into the first sequence and the second sequence to generate a third sequence and a fourth sequence; putting the third sequence and the fourth sequence into corresponding different levels in a Transformer network for feature extraction, thereby obtaining a first global feature and a first local feature; performing feature fusion on the first global feature and the first local feature to obtain a second global feature and a second local feature; and processing the second local features, not processing the second global features, respectively putting the processed second local features and the processed second global features into a specific hierarchy of a Transformer network for feature extraction, and respectively calculating the overall loss according to the extracted features and corresponding loss functions.

Actually, the applicant is concerned with the field of pedestrian re-identification, and finds that, because the conventional method CNN focuses more on local fine-grained feature representation, the pedestrian re-identification needs not only fine-grained features, but also cannot accurately determine the pedestrian re-identification only by means of local features due to the influence of a series of objective factors such as illumination, environmental background change, angle change monitored by a camera, and the like. Recent studies have shown that: the adaptability of the CNN to mass data is not good in imagination, and on the contrary, the performance of the model is more excellent along with the increase of the data volume of the Transformer, so that the advantage of the Transformer on a large data set in pedestrian re-identification is more obvious than that of the CNN. In addition, CNNs focus on local feature extraction, are prone to be trapped in local optimization, and pooling in general CNNs in the network can cause information loss problems. In the prior art, a Transformer network is used for extracting global features, and the same importance is also provided for fine-grained features in the task of pedestrian re-identification.

On the basis, the original image is divided into two branches, one branch extracts coarse-grained characteristics, the other branch extracts fine-grained characteristics, the obtained two-dimensional branches expressing the coarse-grained characteristics and the fine-grained characteristics are converted into a one-dimensional sequence to obtain a first sequence and a second sequence, so that the two-dimensional branches can simultaneously extract the coarse-grained characteristics and the fine-grained characteristics, and the accuracy of a task of re-identifying pedestrians is greatly improved; in order to avoid a series of adverse factors brought by model identification deviation caused by the influence of objective factors such as illumination factors, angle impression or different background environments on pictures extracted by a camera, parameters are added into a first sequence and a second sequence to obtain a third sequence and a fourth sequence; then putting the third sequence and the fourth sequence into the corresponding layer number of the transform network for feature extraction; after the features of two different branches are extracted, feature fusion is carried out, interaction is carried out along with thickness and granularity information between the two branches, and then the features with different scales can be selected to be fused with each other; after the features are fused, the local features are processed in advance, so that the effect of enhancing the data image is achieved, and the output process has more robustness; and finally, calculating the processed local features and the global features by using corresponding loss functions respectively to obtain the overall loss.

The invention also provides a pedestrian re-identification device based on the transform network, which comprises a module for realizing the method, and the device adopts a corresponding module, so the device has the same beneficial effect as the method, and the details are not repeated.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a pedestrian re-identification method based on a transform network according to an embodiment of the present invention;

fig. 2 is a flowchart before step S1 according to an embodiment of the present invention;

fig. 3 is a flowchart of step S4 according to an embodiment of the present invention;

fig. 4 is a flowchart of step S6 according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a pedestrian re-identification apparatus based on a transform network according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Embodiments of the present invention are written in a progressive manner.

The embodiment of the invention provides a pedestrian re-identification method based on a Transformer network. The method mainly solves the technical problems that in the prior art, CNN only focuses on extraction of local features, an existing Transformer network is prone to global feature extraction, and the CNN and the Transformer network have partial defects and are not beneficial to improving accuracy and robustness of a pedestrian re-identification task.

s1, dividing an original image to obtain a first branch and a second branch;

s2, linearly mapping the first branch and the second branch to obtain a first sequence and a second sequence;

s3, adding parameters for the first sequence and the second sequence respectively to obtain a third sequence and a fourth sequence;

s4, putting the third sequence and the fourth sequence into a transform network to perform feature extraction so as to obtain a first global feature and a first local feature;

s5, fusing the first global feature and the first local feature by the feature to obtain a second global feature and a second local feature;

s6, respectively putting the processed second local features and the processed second global features into a specific level of a Transformer network for feature extraction, and respectively calculating according to corresponding loss functions to obtain the overall loss;

the third sequence is the first sequence added with the parameter, and the fourth sequence is the second sequence added with the parameter.

In the step S1, since the existing transform network is mainly used for feature acquisition of a global sequence, and is especially important for fine-grained feature acquisition in a task of pedestrian re-identification, in the first step, a picture is divided into a plurality of small blocks by two different patches (in the CNN learning training process, instead of processing a whole picture once, only one block of the picture is viewed by a kernel or a filter or a feature detector each time, the small block is called a patch, and then the filter is moved to another patch of the picture) to perform feature extraction, so as to obtain a first Branch B-Branch and a second Branch S-Branch;

in step S2, two branches simultaneously obtain one-dimensional tokens (i.e., a first sequence and a second sequence) in a one-dimensional token (i.e., a sequence, a basic unit for processing in a transform network) obtained by one layer of linear mapping, and each branch obtains one-dimensional token;

in step S3, in order to avoid a series of adverse factors brought by model identification deviation caused by the influence of objective factors such as illumination factors, angle impression or different background environments on pictures extracted by a camera, corresponding parameters are respectively added to the first sequence and the second sequence to obtain a first sequence after the parameters are added and a second sequence after the parameters are added, namely a third sequence and a fourth sequence;

in step S4, designing two branches with different sizes according to the layer number of the maximum adaptive Transformer, so that the two branches are respectively put into the transformers with different layer numbers for feature extraction, and a first global feature cls-token and a first local feature patch-tokens are obtained;

in the step S5, when the characteristics are fused, a first global characteristic cls-token of one Branch B-Branch and a first local characteristic patch-token of another Branch S-Branch are fused and interacted; the other Branch is S-Branch; obtaining a second global feature and a second local feature after the fusion is completed;

in step S6, the processed second local features patch-tokens 'and the processed second global features cls-tokens' are respectively put into a specific level of the transform network for feature extraction, and the overall loss is calculated according to the corresponding loss function in the transform network.

Preferably, before step S1, the following steps are further included:

A1. presetting a first plaque and a second plaque;

A2. inputting an original image;

wherein the first plaque size is larger than the second plaque size.

In the step A1, setting two patches with different sizes to extract features, and obtaining two branches with different sizes; two branches are specifically, one is the large branch: B-Branch, dividing the picture by using a larger first patch size to obtain coarse-grained characteristics; one is the small branch: S-Branch, dividing the picture by using a smaller second patch size to obtain fine-grained characteristics;

in the step A2, original images obtained in various scenes such as image retrieval, security environment or criminal investigation work are input.

Preferably, the parameters in step S3 are specifically: auxiliary information and position-coding information.

In the actual application process, the parameters include auxiliary information and position coding information, wherein the auxiliary information (such as viewpoint information in a picture or camera information such as a camera ID) is used for avoiding the situation that a model is inaccurate due to objective problems such as camera pixels or illumination angles; the position-coding information means: the transformers all contain a key technology, namely position encoding (position encoding), and have the functions of improving the perception capability of the model to position information and making up the deficiency of the position information in a Self Attention mechanism; by adding auxiliary information and position coding information, the method can assist the Transformer network to better perform feature extraction and feature fusion

Preferably, step S4 specifically includes the following steps:

B1. acquiring a first layer number corresponding to the third sequence according to an ablation experiment;

B2. acquiring a second layer number corresponding to the fourth sequence according to the ablation experiment;

B3. putting the third sequence into a first layer number of transform networks, and after putting the fourth sequence into a second layer number of transform networks, respectively extracting features to obtain a first global feature and a first local feature of the third sequence and a first global feature and a first local feature of the fourth sequence;

In steps B1 and B2, the ablation experiment means that when the author proposes a new scheme, the scheme changes multiple conditions or parameters at the same time, and then in the ablation experiment, the experimenter can control one condition or parameter to be unchanged one by one to see the result, and what condition or parameter has a greater influence on the result. In short, it is a control variable method. The number of layers corresponding to the third sequence and the fourth sequence is obtained by designing an ablation experiment, generally, the number of layers of the Transformer which is most suitable for the third sequence and the fourth sequence is 3-7, and the first number of layers and the second number of layers can be specifically selected according to actual needs, and only the first number of layers and the second number of layers are different, in this embodiment, the first number of layers is 4, and the second number of layers is 5;

in the step B3, the third sequence is placed in a transform network of a layer 4, the second sequence is placed in a transform network of a layer 5, and feature extraction is respectively carried out on the third sequence and the fourth sequence, so that a first global feature cls-token and a first local feature patch-token of the third sequence and a first global feature cls-token and a first local feature patch-token of the fourth sequence are obtained; the specific operation mode of feature extraction is to directly use an encoder-decoder (encoding-decoding) in a Transformer network to carry out feature interaction; it should be noted that the Encoder-Decoder is not a specific model, but a general framework, the Encoder and Decoder portions may be any text, speech, image, video data, and the model may be CNN, RNN, biRNN, LSTM, GRU, etc., and in this embodiment, is a general framework in a transform network.

Preferably, step S5 specifically includes the following steps:

In the actual application process, it should be noted that the feature fusion has the function of receiving two sets of features as input and outputting two sets of updated features; the second global feature clls-token 'is the updated first global feature cls-token, and the second local feature patch-token' is the updated first global feature patch-token;

the Cross Attention Network means that Cross Attention Network (CAN) mainly comprises an Embedding operation and a Cross Attention Module, wherein the Embedding operation is mainly used for branch extraction. The CAN is finally composed of a local classifier and a global classifier. The local classifier calculates the similarity between the two characteristics through the cosine distance between the support set characteristics and the query set characteristics so as to obtain the probability value of the query set characteristics. The global classifier passes through a fully connected layer and then directly classifies by Softmax.

Preferably, before step S5, the following steps are further included:

in a cross-attention network, fully functional nodes are replaced with mapping and demapping relationships.

In the actual application process, in order to reduce the influence of time complexity and space complexity brought by the Transformer, a part of a Full Function Node (FFN) in cross-association is replaced by mapping and reflection, so that the speed of the cross-association network is greatly improved.

Preferably, before step S6, the following steps are further included:

performing a shuffle operation on the second local features;

the shuffling operation specifically comprises the following steps: the side information and the position-coding information in the second local feature are scrambled.

In an actual operation process, in this embodiment, the shuffling operation is specifically performed when there is position information embedding auxiliary information and cut tile map information patch embedding in one token. Disorganizing the position information and the auxiliary information in the tokens with different sequences in the second local feature patch-tokens', and enhancing robustness _， However, the patch embedding is not changed and is still in placeAnd (4) placing.

In the actual application process, the second local feature patch-tokens 'and the second global feature cls-tokens' are respectively put into the last layer of the Transformer network.

Preferably, step S6 specifically includes the following steps:

C1. respectively placing the second local feature and the second global feature after the shuffling operation into the last layer of a transform network for feature extraction to obtain a third global feature and a third local feature;

C2. acquiring a first Loss according to the third global characteristic and a preset Loss function ID Loss;

C3. acquiring a second Loss according to the third local feature and a preset Loss function Triple Loss;

C4. obtaining the overall loss according to the first loss and the second loss;

In the step C1, after the shuffling operation, transmitting the second local features patch-tokens 'to the last layer of the Transformer network for feature extraction to obtain third local features patch-tokens'; meanwhile, the features of the second global feature cls-tokens 'are not shuffled, but are directly transmitted into the last layer of the Transformer network for feature extraction to obtain a third global feature cls-tokens';

in the step C2 and the step C3, the ID Loss and the Triple Loss are common Loss functions belonging to the task of re-identifying pedestrians. Inputting the third global feature cls-tokens' into a Loss function ID Loss, calculating to obtain a first Loss, and inputting the third local feature patch-tokens into a Loss function Triple Loss, calculating to obtain a second Loss;

in step C4, the overall loss, i.e. the average value of the sum of the first loss and the second loss, is calculated according to the first loss and the second loss.

A pedestrian re-identification device based on a Transformer network comprises:

the parameter adding module is used for adding parameters for the first sequence and the second sequence to convert the first sequence and the second sequence into a third sequence and a fourth sequence;

the feature extraction module is used for respectively extracting a first global feature and a first local feature of the third sequence and the fourth sequence;

the feature fusion module is used for fusing the first global feature and the first local feature to obtain a second global feature and a second local feature;

the processing module is used for processing the second local characteristics;

In the actual application process, the dividing module divides the original image input into the dividing module into a first branch and a second branch, and then inputs the first branch and the second branch into the mapping module; the mapping module respectively carries out linear mapping on the first branch and the second branch to obtain a first sequence and a second sequence, and the first sequence and the second sequence are transmitted to the parameter adding module; the parameter adding module is used for adding parameters to the first sequence and the second sequence respectively to form a third sequence and a fourth sequence, and transmitting the third sequence and the fourth sequence to the feature extraction module; the feature extraction module extracts a first global feature and a first local feature from the third sequence and the fourth sequence respectively, and transmits the first global feature and the first local feature to the feature fusion module; the feature fusion module performs feature fusion on the first global feature and the first local feature to obtain a second global feature and a second local feature, transmits the second local feature to the processing module, and transmits the second global feature to the calculation module; the processing module processes the second local characteristic and transmits the processed second local characteristic to the computing module; and the calculation module calculates and acquires the overall loss according to the processed second local feature, the processed second global feature and the corresponding loss function.

In the embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the modules is only one division of logical functions, and other divisions may be realized in practice, such as: multiple modules or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be electrical, mechanical or in other forms.

In addition, all functional modules in the embodiments of the present invention may be integrated into one processor, or each module may be separately used as one device, or two or more modules may be integrated into one device; each functional module in each embodiment of the present invention may be implemented in the form of hardware, or in the form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by program instructions and related hardware, where the program instructions may be stored in a computer-readable storage medium, and when executed, the program instructions perform the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

It should be understood that the use of "system," "device," "unit," and/or "module" herein is merely one way to distinguish between different components, elements, components, parts, or assemblies of different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this application and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" are intended to cover only the explicitly identified steps or elements as not constituting an exclusive list and that the method or apparatus may comprise further steps or elements. An element defined by the phrase "comprising a … …" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

If used in this application, the flowcharts are intended to illustrate operations performed by the system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

The pedestrian re-identification method and device based on the transform network provided by the invention are introduced in detail above. The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A pedestrian re-identification method based on a Transformer network is characterized by comprising the following steps:

dividing the original image to obtain a first branch and a second branch;

2. The method for pedestrian re-identification based on the Transformer network as claimed in claim 1, wherein before the dividing the original image to obtain the first branch and the second branch, further comprising the steps of:

presetting a first patch and a second patch;

inputting the original image;

wherein the first blob size is larger than the second blob size.

3. The pedestrian re-identification method based on the Transformer network as claimed in claim 2, wherein the parameters are specifically: auxiliary information and position-coding information.

4. The method for re-identifying pedestrians based on the transform network as claimed in claim 3, wherein the step of putting the third sequence and the fourth sequence into the transform network with corresponding layer numbers for feature extraction to obtain the first global feature and the first local feature comprises the following steps:

acquiring a first layer number corresponding to the third sequence according to the ablation experiment;

putting a third sequence into the first layer number of the Transformer network, and after putting a fourth sequence into the second layer number of the Transformer network, respectively performing feature extraction to obtain a first global feature and a first local feature of the third sequence and a first global feature and a first local feature of the fourth sequence;

5. The method for pedestrian re-identification based on the transform network as claimed in claim 4, wherein the feature fusing the first global feature and the first local feature to obtain a second global feature and a second local feature, specifically comprises the following steps:

6. The method for pedestrian re-identification based on the Transformer network as claimed in claim 5, wherein before the feature fusing the first global feature and the first local feature to obtain a second global feature and a second local feature, further comprising the steps of:

7. The method for re-identifying pedestrians based on the transform network as claimed in claim 6, wherein before the second local feature and the second global feature after processing are respectively placed into a transform network specific hierarchy for feature extraction, and the overall loss is respectively calculated and obtained according to the corresponding loss function, the method further comprises the following steps:

performing a shuffle operation on the second local features;

the shuffling operation specifically comprises the following steps: the auxiliary information and the position-coding information in the second local feature are scrambled.

8. The method for pedestrian re-identification based on the transform network as claimed in claim 7, wherein the specific level is a last layer in the transform network.

9. The method for re-identifying pedestrians based on a transform network of claim 8, wherein the step of respectively placing the processed second local features and the second global features into a specific hierarchy of the transform network for feature extraction, and respectively calculating and obtaining the overall loss according to the corresponding loss function comprises the following steps:

obtaining the overall loss according to the first loss and the second loss;

10. A pedestrian re-identification device based on a Transformer network is characterized by comprising:

a parameter adding module, configured to add parameters to the first sequence and the second sequence, so as to convert the sequences into a third sequence and a fourth sequence;

the processing module is used for processing the second local features;

and the calculating module is used for calculating and obtaining the overall loss according to the processed second local feature, the processed second global feature and the corresponding loss function.