CN117746462A

CN117746462A - Pedestrian re-recognition method and device based on complementary feature dynamic fusion network model

Info

Publication number: CN117746462A
Application number: CN202311762029.9A
Authority: CN
Inventors: 连国云; 杨金锋; 连睿
Original assignee: Shenzhen Vocational And Technical University
Current assignee: Shenzhen Vocational And Technical University
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-03-22

Abstract

The invention discloses a pedestrian re-recognition method and device based on a complementary feature dynamic fusion network model, wherein the complementary feature dynamic fusion network model comprises an HFE main branch and an auxiliary branch, the HFE main branch comprises a Vit network module and an NFC module, the method comprises the steps of obtaining a result vector of a target pedestrian from an image to be recognized based on the Vit network module, and extracting a pedestrian global feature vector from the result vector; acquiring a spliced vector through an NFC module, and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches; and determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector. Therefore, feature fusion is carried out based on the Vit network module and the auxiliary branch module, so that a local feature vector with richer and more accurate details is obtained, and re-recognition is carried out based on the global feature vector and the local feature vector, so that the accuracy of a pedestrian re-recognition result is improved.

Description

Pedestrian re-recognition method and device based on complementary feature dynamic fusion network model

Technical Field

The invention relates to the technical field of image processing, in particular to a pedestrian re-identification method and device based on a complementary feature dynamic fusion network model.

Background

The Person Re-identification (ReID) is different from the identification modes with unique identification such as face identification, iris identification and fingerprint identification, the Person Re-identification mainly depends on the apparent characteristics of the Person, is closely related to the apparent characteristics such as clothing, gesture and the like, and has wider application prospects in detection security, pedestrian behavior analysis and the like. Since pedestrians can freely go in and out within the range of the camera, the camera is not limited by any condition. Therefore, the photographed pedestrian images are seriously affected by the photographing visual angle difference, illumination change, object shielding, noisy background and other environmental factors, so that the appearance characteristics (resolution, gesture and the like) of the same pedestrian in different images are greatly different, and even the complete pedestrian images cannot be captured due to the object shielding. In addition, different pedestrians have many visual similarities in clothing colors, patterns and the like, adding more difficulty to recognition.

There are many techniques currently for pedestrian re-recognition, but these tend to rely on sufficient pose-tagged character images and a powerful pose estimation network, and are therefore easily constrained by additional assistance models. Still other pedestrian re-recognition techniques require attention through convolution operations, ignoring global information and implicit relationships of images, and thus limiting exploration of correlations between model learning features, such that the accuracy of pedestrian re-recognition is not high enough.

Disclosure of Invention

The invention provides a pedestrian re-recognition method and device based on a complementary feature dynamic fusion network model, aiming at improving the accuracy of pedestrian re-recognition.

To achieve the above object, the present invention provides a pedestrian re-recognition method based on a complementary feature dynamic fusion network model, which is characterized in that the complementary feature dynamic fusion network model includes a multi-level feature extraction (Hierarchical Feature Extraction, HFE) main branch and an auxiliary branch, wherein the HFE main branch includes a vision transformer (Vision Transformer, vit) network module and a neighborhood feature constraint (Neighborhood Feature Constraint, NFC) module;

the pedestrian re-identification method based on the complementary feature dynamic fusion network model comprises the following steps:

acquiring a result vector of a target pedestrian from an image to be identified based on the Vit network module, and extracting a pedestrian global feature vector from the result vector;

acquiring a spliced vector through the NFC module, and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches;

and determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector.

Optionally, the obtaining, based on the Vit network module, a result vector of the target pedestrian from the image to be identified, and extracting a global feature vector of the pedestrian from the result vector includes:

obtaining a result vector of a pedestrian to be identified based on the image to be identified and basic information based on the Vit network module in the HFE main branch;

pedestrian global feature vectors are extracted from the result vectors in stages based on a plurality of conversion blocks.

Optionally, the obtaining, by the Vit network module in the HFE-based main branch, a result vector of the pedestrian to be identified based on the image to be identified and the basic information includes:

dividing the image to be identified into a plurality of mutually overlapped image blocks based on convolution operation of the Vit network module, and converting the image blocks into vector sequences;

and adding the position information codes and the camera information codes into the vector sequence in an element addition mode to obtain the result vector.

Optionally, before the acquiring, by the NFC module, the splice vector and extracting the pedestrian local feature vector based on the splice vector and the complementary two-dimensional feature input by the auxiliary branch, the method further includes:

and extracting the three-dimensional feature of the image to be identified based on the auxiliary branch, and integrating the two-dimensional feature corresponding to the three-dimensional feature with the inheritance class vector obtained from the main branch to obtain the supplementary two-dimensional feature.

Optionally, the secondary branches include a convolutional neural network (Convolutional Neural Networks, CNN) module, a signature remodeling (Down Sampling and Flatten, DSF) module, and a dynamic fusion (Shared Dynamic Aggregation, SDA) module;

the extracting the three-dimensional feature of the image to be identified based on the auxiliary branch, integrating the two-dimensional feature corresponding to the three-dimensional feature with the inheritance class vector obtained from the main branch, and obtaining the supplementary two-dimensional feature comprises:

extracting three-dimensional features of an input image based on a bottleneck residual block of the CNN module;

the dimension of the three-dimensional feature is reduced based on a DSF module, and a two-dimensional hierarchical feature is obtained;

and splicing the two-dimensional hierarchical features to inheritance class vectors inherited from the main branches to obtain complementary two-dimensional features.

Optionally, the HFE main branch includes a Vit network module and an NFC module;

acquiring the splice vector through the NFC module, and extracting the pedestrian local feature vector based on the splice vector and the supplementary two-dimensional feature input by the auxiliary branch comprises:

extracting a target class vector from target features output by a penultimate layer conversion block in the Vit network module based on the NFC module;

dividing a vector sequence obtained by the Vit network module into a plurality of sub-vector sequences;

respectively splicing the target class vector with the sub-vector sequence to obtain a spliced vector;

pedestrian local feature vectors are extracted in stages from the splice vector and supplemental two-dimensional features of the secondary branch input based on a shared conversion block.

Optionally, the extracting the pedestrian local feature vector from the splice vector and the complementary two-dimensional feature of the auxiliary branch input in stages based on the shared conversion block includes:

in a shallow stage, the complementary two-dimensional features interact with the spliced vectors in a one-to-one correspondence manner, and interaction results are stored as result vectors, wherein the complementary two-dimensional features comprise pedestrian texture information and edge texture information;

and in a deep stage, aggregating the supplemented two-dimensional features with the spliced vector in a mode of aggregating adjacent convolution layers to obtain a result vector, wherein the supplemented two-dimensional features comprise discriminant features.

Optionally, the determining the pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector includes:

splicing the pedestrian global feature vector and the pedestrian local feature vector to obtain a pedestrian re-recognition feature vector;

and determining a pedestrian re-recognition result based on the similarity of the pedestrian re-recognition feature vector and the target pedestrian feature vector.

Optionally, before the obtaining, based on the Vit network module, a result vector of the target pedestrian from the image to be identified and extracting a global feature vector of the pedestrian from the result vector, the method further includes:

and monitoring the complementary feature dynamic fusion network model based on the classification loss and the triplet loss in the training stage to obtain a total loss function so as to optimize the complementary feature dynamic fusion network model based on the total loss function.

In addition, the invention also provides a pedestrian re-identification device based on the complementary feature dynamic fusion network model, which comprises:

the global feature extraction module is used for acquiring a result vector of a target pedestrian from the image to be identified based on the Vit network module, and extracting a pedestrian global feature vector from the result vector;

the local feature extraction module is used for obtaining a spliced vector through the NFC module and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches;

and the recognition module is used for determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector.

Compared with the prior art, the pedestrian re-recognition method and device based on the complementary feature dynamic fusion network model provided by the invention have the advantages that the complementary feature dynamic fusion network model comprises a multi-level feature extraction main branch and an auxiliary branch, the HFE main branch comprises a vision converter network module and a neighborhood feature constraint module, the method comprises the steps of obtaining a result vector of a target pedestrian from an image to be recognized based on a Vit network module, and extracting a pedestrian global feature vector from the result vector; acquiring a spliced vector through the NFC module, and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches; and determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector. Therefore, feature fusion is carried out based on the Vit network module and the auxiliary branch module, so that a local feature vector with richer and more accurate details is obtained, and re-recognition is carried out based on the global feature vector and the local feature vector, so that the accuracy of a pedestrian re-recognition result is improved.

Drawings

FIG. 1 is a schematic hardware architecture of a pedestrian re-recognition device based on a complementary feature dynamic fusion network model according to various embodiments of the present invention;

FIG. 2 is a flow chart of a first embodiment of a pedestrian re-recognition method based on a complementary feature dynamic fusion network model of the present invention;

FIG. 3 is a diagram of a structure of a dynamic fusion network model based on complementary features, which relates to a pedestrian re-recognition method based on the dynamic fusion network model based on complementary features;

FIG. 4 is a schematic diagram of a first refinement flow of a first embodiment of a pedestrian re-recognition method based on a complementary feature dynamic fusion network model of the present invention;

FIG. 5 is a schematic flow chart of a first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model of the present invention;

FIG. 6 is a schematic diagram of a feature remodeling module involved in the pedestrian re-recognition method based on the complementary feature dynamic fusion network model of the invention;

FIG. 7 is a schematic flow chart of a second refinement of the first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model of the present invention;

FIG. 8 is a pedestrian re-recognition result image related to a first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model of the present invention;

fig. 9 is a schematic functional block diagram of a first embodiment of the pedestrian re-recognition device based on the complementary feature dynamic fusion network model.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The pedestrian re-identification device based on the complementary feature dynamic fusion network model mainly comprises network connection equipment capable of realizing network connection, and the pedestrian re-identification device based on the complementary feature dynamic fusion network model can be a server, a cloud platform and the like.

Referring to fig. 1, fig. 1 is a schematic hardware structure of a pedestrian re-recognition device based on a complementary feature dynamic fusion network model according to various embodiments of the present invention. In an embodiment of the present invention, the pedestrian re-recognition device based on the complementary feature dynamic fusion network model may include a processor 1001 (e.g. a central processing unit Central Processing Unit, a CPU), a communication bus 1002, an input port 1003, an output port 1004, and a memory 1005. Wherein the communication bus 1002 is used to enable connected communications between these components; the input port 1003 is used for data input; the output port 1004 is used for data output, and the memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory, and the memory 1005 may be an optional storage device independent of the processor 1001. Those skilled in the art will appreciate that the hardware configuration shown in fig. 1 is not limiting of the invention and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

With continued reference to FIG. 1, the memory 1005 of FIG. 1, which is a readable storage medium, may include an operating system, a network communication module, an application module, and a pedestrian re-recognition program that dynamically fuses network models based on complementary features. In fig. 1, the network communication module is mainly used for connecting with a server and performing data communication with the server; and the processor 1001 is configured to invoke the pedestrian re-recognition program based on the complementary feature dynamic fusion network model stored in the memory 1005, and perform the following operations:

As shown in fig. 2, a first embodiment of the present invention proposes a pedestrian re-recognition method based on a complementary feature dynamic fusion network model, where the method includes:

step S101, obtaining a result vector of a target pedestrian from an image to be identified based on the Vit network module, and extracting a pedestrian global feature vector from the result vector;

in order to improve the accuracy of pedestrian recognition, a dynamic fusion network model based on complementary features is proposed by utilizing the characterization modeling capability of convolutional neural networks (Convolutional Neural Networks, CNN) and Vision Transformer (visual transducer, viT). A complementary feature based dynamic fusion network model includes a multi-level feature extraction (Hierarchical Feature Extraction, HFE) primary branch and a secondary branch, the HFE primary branch including a vision transformer (Vision Transformer, vit) network module and a neighborhood feature constraint (Neighborhood Feature Constraint, NFC) module; the secondary branches include a convolutional neural network (Convolutional Neural Networks, CNN) module, a signature remodeling (Down Sampling and Flatten, DSF) module, and a dynamic fusion (Shared Dynamic Aggregation, SDA) module. The CNN convolutional neural network has characteristic learning (representation learning) capability and can carry out translation invariant classification on an input image according to a hierarchical structure of the CNN convolutional neural network. Vit is capable of modeling global information and has excellent performance. The dynamic fusion network model based on the complementary features is shown in fig. 3, and fig. 3 is a structural diagram of the dynamic fusion network model based on the complementary features, which relates to a pedestrian re-identification method based on the dynamic fusion network model based on the complementary features. The network model consists of two branches, the feature preference extracted by each branch is different, the auxiliary branch provides feature complementation for the main branch, and in order to ensure the training effect of dynamic fusion network of complementary features, a neighborhood feature constraint module (Neighborhood Feature Constraint, NFC) is additionally introduced into the HFE main branch to make the network concentrate on extracting local key features.

Referring to fig. 4, fig. 4 is a schematic diagram of a first refinement flow of a first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model according to the present invention, as shown in fig. 4, the step S101 includes:

step S1011, obtaining a result vector of a pedestrian to be identified based on the image to be identified and basic information based on the Vit network module in the HFE main branch;

In the Vit network of the main branch of the HFE, for a given image to be identifiedWherein H, W, C respectively represents the height, width and channel number of the image to be identified, the image to be identified x is cut into N image blocks overlapping each other with p×p size by convolution operation, and the image blocks are converted into a vector sequence for subsequent calculation of feature correlation based on the vector sequence, which is expressed as:

where N represents the length of the vector sequence,s represents the step length of convolution operation, S < P; (P) ² X C) represents the dimension of the vector sequence.

Reuse of a trainable linear mapping will (P) ² X C) dimension vector sequence maps to d dimension, in this embodiment the value of d takes 2.

After the d-dimensional vector sequence is obtained, a learnable one-dimensional position information code is added to the vector sequence according to a standard Vision Transformer in an element addition mode, and a camera information code is added in the same mode to obtain a result vector, wherein the result vector can be expressed as

Step S1012, extracting pedestrian global feature vectors from the result vectors in stages based on a plurality of conversion blocks.

After the result vector is obtained, pedestrian global features are extracted in stages through 12 conversion blocks (Transformer Block) in the Vit network respectively. With continued reference to fig. 3, the shallow SDA in the Vit network _1-1×9 And deep SDA _2-1×2 、SDA _3-1×1 The extraction of the global features of pedestrians is respectively carried out. The CLS (class) of the Vit main branch encodes the statistical properties of the collection, thereby being an overall characterization of pedestrian identity.

Step S102, a spliced vector is obtained through the NFC module, and a pedestrian local feature vector is extracted based on the spliced vector and supplementary two-dimensional features input by auxiliary branches;

in this embodiment, the complementary two-dimensional features need to be obtained in advance based on the auxiliary branches, referring to fig. 5, fig. 5 is a schematic flow diagram of a first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model according to the present invention, as shown in fig. 5, before step S102, the method further includes:

step S1002, extracting a three-dimensional feature of the image to be identified based on the auxiliary branch, and integrating a two-dimensional feature corresponding to the three-dimensional feature with an inheritance class vector obtained from the main branch to obtain the complementary two-dimensional feature.

The three-dimensional characteristics of the input image are extracted based on the bottleneck residual block of the CNN module;

the dimension of the three-dimensional feature is reduced based on a DSF (feature remodeling) module, and a two-dimensional hierarchical feature is obtained;

The secondary branches include a CNN convolutional neural network (Convolutional Neural Networks, CNN) module, a signature remodeling (Down Sampling and Flatten, DSF) module, and a dynamic fusion (Shared Dynamic Aggregation, SDA) module. The secondary branches consist of CNN networks stacked with bottleneck residual modules consisting of small convolutions. The present embodiment features different convolution layers as H _i ，h, w and c represent the length, width and channel number of the feature map, respectively. In order to further improve the capability of the Vit network in the main branch of the HFE to select complementary features on the auxiliary branch and activate the features of the high discriminant region, the present embodiment proposes a shared dynamic aggregation module (Shared Dynamic Aggregation, SDA) composed of attention modules to provide detail information of different scales for the HFE.

In a specific implementation process, heterogeneous features are dynamically fused beforeIt is necessary to solve heterogeneous characteristics (i.e. H in the secondary branch _i And the result vector Z) in the primary branch, in the feature dimension, features captured in the secondary branchIs three-dimensional and the length h, width w and channel number c in different stages are different. The features in HFE are two-dimensional, +.>And the hierarchy characteristics of different stages of the auxiliary branch are always unchanged, and the current three-dimensional hierarchy characteristics H are subjected to feature remodelling modules (Down Sampling and Flatten, DSF downsampling and flattening) before entering any SDA module _i Conversion to a fixed two-dimensional feature f _i . Referring to fig. 6, fig. 6 is a schematic diagram of a feature remodeling module related to a pedestrian re-recognition method based on a complementary feature dynamic fusion network model of the present invention, where the feature remodeling module includes convolutional layer content (post-normalization processing), average pooling layer Avgpool, flattened flat and Transpose Transpore operations as illustrated in fig. 6.

Two-dimensional feature f _i The class vector in the main branch is inherited, the inherited class vector is defined as an inherited class vector (ICLS), and the inherited class vector is respectively spliced to each two-dimensional feature f _i And the complementary two-dimensional characteristics are obtained, so that the fusion of the CNN network and the Vit network is realized, and the fusion utilization of the local characteristics and the global characteristics is realized.

And after the complementary two-dimensional features are obtained, extracting pedestrian local feature vectors based on the spliced vectors and the complementary two-dimensional features input by the auxiliary branches. Referring specifically to fig. 7, fig. 7 is a schematic flow chart of a second refinement of the first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model of the present invention; as shown in fig. 7, the step S102 includes:

step S1021, obtaining a stitching vector by the NFC module, and extracting a pedestrian local feature vector based on the stitching vector and the complementary two-dimensional feature input by the auxiliary branch includes:

step S1022, extracting a target class vector from target features output by a penultimate layer conversion block in the Vit network module based on the NFC module;

step S1023, dividing the vector sequence obtained by the Vit network module into a plurality of sub-vector sequences;

step S1024, splicing the target class vector with the sub-vector sequence to obtain a spliced vector;

step S1025, extracting pedestrian local feature vectors from the splice vector and the supplementary two-dimensional features of the auxiliary branch input in stages based on the shared conversion block.

In order to extract the pedestrian local fine granularity feature, the embodiment adds an NFC module in the HFE main branch. The embodiment extracts the target class vector from the target feature output by the penultimate layer conversion block in the Vit network module based on the NFC module.

And dividing the vector sequence into S groups along the length direction of the vector sequence to obtain S sub-vector sequences corresponding to S local neighborhood features, wherein S refers to the proportion of the neighborhood feature region to the last hidden layer feature of the Vit network in the HFE main branch. The vector sequence of this embodiment is expressed as:

splicing the sub-vector sequences representing the local neighborhood characteristics with the target class vectors to obtain splicing vectors, and representing the splicing vectors as:

in this way, local neighborhood features including target class vector features are obtained.

And in the shared conversion block shared by the splice vector and the supplementary two-dimensional feature input by the auxiliary branch, calculating the correlation of the features in each local area, and extracting S pedestrian local feature vectors in stages based on the shared conversion block:

specifically, in a shallow stage, the complementary two-dimensional features and the spliced vector are interacted in a one-to-one correspondence manner, and an interaction result is stored as a result vector, wherein the complementary two-dimensional features comprise pedestrian texture information and edge texture information;

The present embodiment learns features within the SDA module that have dependencies and complementarity to the class vector CLS. The SDA module consists of a multi-head attention mechanism, a multi-layer perceptron, layer normalization and residual error splicing.

The need for complementary features is different at different stages of the Vit network of the main branch of the HFE, enough detail cues help to build high quality body part relationships in the shallow layer of the HFE, pedestrian texture information and edge texture information captured by the shallow layer of the auxiliary branch are relatively complete and the pedestrian semantic information is clear, so for the first 9 Transformer Block of the HFE, complementary hierarchical features interact with Transformer Block output features in the HFE in a one-to-one correspondence, as in SDA in fig. 3 _1-1 As shown. In the deep layer of the HFE, diversified distinguishing features need to be concentrated on rough pedestrian body parts, at this time, the features extracted from the deep layer of the auxiliary branches have abstract properties, and distinguishing features of different modes of pedestrians are extracted from different channels. In the deep layer of the Vit network, the auxiliary branch feature can activate the discriminant feature in the HFE to increase the convergence rate, which is achieved by aggregating the features of adjacent convolution layers, such as SDA in FIG. 3 _2-1 And SDA (integrated data access) _3-1 As shown.

And step S103, determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector.

Specifically, the pedestrian global feature vector and the pedestrian local feature vector are spliced to obtain a pedestrian re-identification feature vector; and determining a pedestrian re-recognition result based on the similarity of the pedestrian re-recognition feature vector and the target pedestrian feature vector.

The embodiment splices the obtained pedestrian global feature vector and the pedestrian local feature vector to obtain a final pedestrian re-identification feature vector. And performing similarity comparison on each image to be identified and the target pedestrian image to obtain an identification distance matrix, wherein the image to be identified with the minimum distance with the target pedestrian image is regarded as the most probable candidate pedestrian. The euclidean distance of two samples in the embedding space is typically calculated as the similarity, since euclidean distance is the simplest and easiest to understand.

The present embodiment defines x _p And x _g Representing the target pedestrian image (probe) and the image to be identified in the candidate set gamma respectively,and->The pedestrian re-recognition feature vector of the target pedestrian image and the pedestrian re-recognition feature vector of the image to be recognized are respectively represented. />And->Is expressed as D _pg The following steps are:

the method is used for re-identification, and an accurate identification result can be obtained. Referring to fig. 8, fig. 8 is a pedestrian re-recognition result image related to a first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model of the present invention; as can be seen from the recognition result based on fig. 8, the number of error images retrieved based on the base line network of Vision Transformer (ViT-BoT) is larger than the model proposed by the present invention. Furthermore, among the images retrieved in the present application, the first few images (especially rank-5) are more accurate than the base line network based on Vision Transformer (ViT-BoT). The average accuracy average value reaches 66.6%, which shows the effectiveness and accuracy of pedestrian re-identification based on the complementary feature dynamic fusion network model.

The pedestrian re-recognition task based on deep learning is mainly divided into two stages, wherein the first stage is training of a network, and the second stage is extracting features of a target pedestrian and a pedestrian image in a candidate set by using a trained model to carry out similarity measurement. In the training phase, this embodiment sets only one classifier for the HFE main branch based on the complementary feature dynamic fusion network model. The loss calculated by the main branch classifier is optimized for the whole network.

Specifically, before step S101, the complementary feature dynamic fusion network model is supervised based on classification loss and triplet loss in a training phase, so as to obtain a total loss function, so as to optimize the complementary feature dynamic fusion network model based on the total loss function.

In the training phase, the complementary feature-based dynamic fusion network model is supervised by using classification (ID) Loss and triplet Loss, and the total Loss function of the model is denoted as L, then there is:

where α is a balance factor, typically set to 1; l (L) _ID Representing the classification loss, L _tri Representing triplet loss, L _tri (CLS) represents the triplet loss of CLS,classification loss representing local neighborhood features, +.>The triplet loss representing the local neighborhood feature, in the inference phase, the features in the HFE branch and the NFC sub-branch are stitched together as the final representation of the pedestrian. And obtaining a candidate list matched with the pedestrian to be searched according to the finally represented similarity matrix.

According to the scheme, the Vit network module is based on the fact that the result vector of the target pedestrian is obtained from the image to be identified, and the pedestrian global feature vector is extracted from the result vector; acquiring a spliced vector through the NFC module, and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches; and determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector. Therefore, feature fusion is carried out based on the Vit network module and the auxiliary branch module, so that a local feature vector with richer and more accurate details is obtained, and re-recognition is carried out based on the global feature vector and the local feature vector, so that the accuracy of a pedestrian re-recognition result is improved.

In addition, the invention also provides a pedestrian re-recognition device based on the complementary feature dynamic fusion network model, referring to fig. 9, fig. 9 is a schematic functional block diagram of a first embodiment of the pedestrian re-recognition device based on the complementary feature dynamic fusion network model of the invention, as shown in fig. 9, the pedestrian re-recognition device based on the complementary feature dynamic fusion network model comprises:

the global feature extraction module 10 is configured to obtain a result vector of a target pedestrian from an image to be identified based on the Vit network module, and extract a pedestrian global feature vector from the result vector;

the local feature extraction module 20 is configured to obtain a stitching vector through the NFC module, and extract a pedestrian local feature vector based on the stitching vector and the complementary two-dimensional feature input by the auxiliary branch;

the identifying module 30 is configured to determine a pedestrian re-identifying result based on the pedestrian global feature vector and the pedestrian local feature vector.

Further, the global feature extraction module 10 includes:

the result vector obtaining unit is used for obtaining a result vector of a pedestrian to be identified based on the image to be identified and basic information based on the Vit network module in the HFE main branch;

and the global feature extraction unit is used for extracting the pedestrian global feature vector from the result vector in stages based on a plurality of conversion blocks.

Further, the result vector obtaining unit includes:

the vector sequence obtaining module is used for dividing the image to be identified into a plurality of mutually overlapped image blocks based on the convolution operation of the Vit network module, and converting the image blocks into a vector sequence;

and the adding unit is used for adding the position information codes and the camera information codes into the vector sequence in an element addition mode to obtain the result vector.

Further, the local feature extraction module 20 further includes:

and the integration module is used for extracting the three-dimensional feature of the image to be identified based on the auxiliary branch, and integrating the two-dimensional feature corresponding to the three-dimensional feature with the inheritance class vector obtained from the main branch to obtain the supplementary two-dimensional feature.

Further, the integration module includes:

the three-dimensional feature extraction unit is used for extracting three-dimensional features of the input image based on the bottleneck residual error block of the CNN module;

the two-dimensional hierarchical feature obtaining unit is used for reducing the dimension of the three-dimensional feature based on the DSF module to obtain a two-dimensional hierarchical feature;

the splicing unit is used for splicing the two-dimensional hierarchical features to inheritance class vectors inherited from the main branches to obtain complementary two-dimensional features;

further, the local feature extraction module 20 includes:

the target class vector extraction unit is used for extracting a target class vector from target features output by a penultimate layer conversion block in the Vit network module based on the NFC module;

the segmentation unit is used for segmenting the vector sequence obtained by the Vit network module into a plurality of sub-vector sequences;

the spliced vector obtaining unit is used for respectively splicing the target class vector with the sub-vector sequence to obtain a spliced vector;

and the extraction unit is used for extracting the pedestrian local feature vector from the spliced vector and the supplementary two-dimensional feature input by the auxiliary branch in stages based on the shared conversion block.

Further, the extraction unit includes:

the interaction unit is used for carrying out one-to-one corresponding interaction on the complementary two-dimensional features and the spliced vector in a shallow stage, and storing an interaction result as a result vector, wherein the complementary two-dimensional features comprise pedestrian texture information and edge texture information;

and the aggregation unit is used for aggregating the complementary two-dimensional features with the spliced vector in a deep stage in a mode of aggregating adjacent convolution layers to obtain a result vector, wherein the complementary two-dimensional features comprise discriminant features.

Further, the identification module 30 includes:

the characteristic splicing unit is used for splicing the pedestrian global characteristic vector and the pedestrian local characteristic vector to obtain a pedestrian re-identification characteristic vector;

and the determining unit is used for determining a pedestrian re-recognition result based on the similarity between the pedestrian re-recognition feature vector and the target pedestrian feature vector.

Further, the global feature extraction module 10 further includes:

and the loss module is used for monitoring the complementary feature dynamic fusion network model based on the classification loss and the triplet loss in the training stage to obtain a total loss function so as to optimize the complementary feature dynamic fusion network model based on the total loss function.

In addition, the invention further provides a computer readable storage medium, the computer readable storage medium stores a pedestrian re-identification program based on the complementary feature dynamic fusion network model, and the steps of the pedestrian re-identification method based on the complementary feature dynamic fusion network model are realized when the pedestrian re-identification program based on the complementary feature dynamic fusion network model is run by a processor, and are not repeated herein.

Compared with the prior art, the pedestrian re-recognition method and device for the complementary feature dynamic fusion network model are provided, the complementary feature dynamic fusion network model comprises a multi-level feature extraction main branch and an auxiliary branch, the HFE main branch comprises a vision converter network module and a neighborhood feature constraint module, the method comprises the steps of obtaining a result vector of a target pedestrian from an image to be recognized based on a Vit network module, and extracting a pedestrian global feature vector from the result vector; acquiring a spliced vector through the NFC module, and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches; and determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector. Therefore, feature fusion is carried out based on the Vit network module and the auxiliary branch module, so that a local feature vector with richer and more accurate details is obtained, and re-recognition is carried out based on the global feature vector and the local feature vector, so that the accuracy of a pedestrian re-recognition result is improved.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or modifications in the structures or processes described in the specification and drawings, or the direct or indirect application of the present invention to other related technical fields, are included in the scope of the present invention.

Claims

1. Pedestrian re-recognition method based on a complementary feature dynamic fusion network model, characterized in that the complementary feature dynamic fusion network model comprises a multi-level feature extraction (Hierarchical Feature Extraction, HFE) main branch and an auxiliary branch, wherein the HFE main branch comprises a vision transformer (Vision Transformer, vit) network module and a neighborhood feature constraint (Neighborhood Feature Constraint, NFC) module;

2. The method of claim 1, wherein the obtaining, based on the Vit network module, a result vector of a target pedestrian from an image to be identified, and extracting a pedestrian global feature vector from the result vector comprises:

3. The method of claim 2, wherein the Vit network module in the HFE based main branch obtaining a result vector of a pedestrian to be identified based on an image to be identified and basic information comprises:

4. The method of claim 1, wherein before obtaining a splice vector by the NFC module and extracting a pedestrian local feature vector based on the splice vector and the complementary two-dimensional feature of the secondary branch input, further comprising:

5. The method of claim 4, wherein the secondary branches comprise a convolutional neural network (Convolutional Neural Networks, CNN) module, a signature remodeling (Down Sampling and Flatten, DSF) module, and a dynamic fusion (Shared Dynamic Aggregation, SDA) module;

6. The method according to claim 1, wherein the HFE main branch comprises a Vit network module and an NFC module;

7. The method of claim 6, wherein extracting pedestrian local feature vectors from the splice vector and supplemental two-dimensional features of secondary branch inputs in stages based on a shared conversion block comprises:

8. The method of claim 1, wherein the determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector comprises:

9. The method of claim 1, wherein before the obtaining, based on the Vit network module, a result vector of a target pedestrian from an image to be identified and extracting a global feature vector of the pedestrian from the result vector, the method further comprises:

10. The pedestrian re-identification device based on the complementary feature dynamic fusion network model is characterized by comprising: