CN117746462A - Pedestrian re-recognition method and device based on complementary feature dynamic fusion network model - Google Patents

Pedestrian re-recognition method and device based on complementary feature dynamic fusion network model Download PDF

Info

Publication number
CN117746462A
CN117746462A CN202311762029.9A CN202311762029A CN117746462A CN 117746462 A CN117746462 A CN 117746462A CN 202311762029 A CN202311762029 A CN 202311762029A CN 117746462 A CN117746462 A CN 117746462A
Authority
CN
China
Prior art keywords
vector
pedestrian
feature
module
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311762029.9A
Other languages
Chinese (zh)
Inventor
连国云
杨金锋
连睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Vocational And Technical University
Original Assignee
Shenzhen Vocational And Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Vocational And Technical University filed Critical Shenzhen Vocational And Technical University
Priority to CN202311762029.9A priority Critical patent/CN117746462A/en
Publication of CN117746462A publication Critical patent/CN117746462A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a pedestrian re-recognition method and device based on a complementary feature dynamic fusion network model, wherein the complementary feature dynamic fusion network model comprises an HFE main branch and an auxiliary branch, the HFE main branch comprises a Vit network module and an NFC module, the method comprises the steps of obtaining a result vector of a target pedestrian from an image to be recognized based on the Vit network module, and extracting a pedestrian global feature vector from the result vector; acquiring a spliced vector through an NFC module, and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches; and determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector. Therefore, feature fusion is carried out based on the Vit network module and the auxiliary branch module, so that a local feature vector with richer and more accurate details is obtained, and re-recognition is carried out based on the global feature vector and the local feature vector, so that the accuracy of a pedestrian re-recognition result is improved.

Description

Pedestrian re-recognition method and device based on complementary feature dynamic fusion network model
Technical Field
The invention relates to the technical field of image processing, in particular to a pedestrian re-identification method and device based on a complementary feature dynamic fusion network model.
Background
The Person Re-identification (ReID) is different from the identification modes with unique identification such as face identification, iris identification and fingerprint identification, the Person Re-identification mainly depends on the apparent characteristics of the Person, is closely related to the apparent characteristics such as clothing, gesture and the like, and has wider application prospects in detection security, pedestrian behavior analysis and the like. Since pedestrians can freely go in and out within the range of the camera, the camera is not limited by any condition. Therefore, the photographed pedestrian images are seriously affected by the photographing visual angle difference, illumination change, object shielding, noisy background and other environmental factors, so that the appearance characteristics (resolution, gesture and the like) of the same pedestrian in different images are greatly different, and even the complete pedestrian images cannot be captured due to the object shielding. In addition, different pedestrians have many visual similarities in clothing colors, patterns and the like, adding more difficulty to recognition.
There are many techniques currently for pedestrian re-recognition, but these tend to rely on sufficient pose-tagged character images and a powerful pose estimation network, and are therefore easily constrained by additional assistance models. Still other pedestrian re-recognition techniques require attention through convolution operations, ignoring global information and implicit relationships of images, and thus limiting exploration of correlations between model learning features, such that the accuracy of pedestrian re-recognition is not high enough.
Disclosure of Invention
The invention provides a pedestrian re-recognition method and device based on a complementary feature dynamic fusion network model, aiming at improving the accuracy of pedestrian re-recognition.
To achieve the above object, the present invention provides a pedestrian re-recognition method based on a complementary feature dynamic fusion network model, which is characterized in that the complementary feature dynamic fusion network model includes a multi-level feature extraction (Hierarchical Feature Extraction, HFE) main branch and an auxiliary branch, wherein the HFE main branch includes a vision transformer (Vision Transformer, vit) network module and a neighborhood feature constraint (Neighborhood Feature Constraint, NFC) module;
the pedestrian re-identification method based on the complementary feature dynamic fusion network model comprises the following steps:
acquiring a result vector of a target pedestrian from an image to be identified based on the Vit network module, and extracting a pedestrian global feature vector from the result vector;
acquiring a spliced vector through the NFC module, and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches;
and determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector.
Optionally, the obtaining, based on the Vit network module, a result vector of the target pedestrian from the image to be identified, and extracting a global feature vector of the pedestrian from the result vector includes:
obtaining a result vector of a pedestrian to be identified based on the image to be identified and basic information based on the Vit network module in the HFE main branch;
pedestrian global feature vectors are extracted from the result vectors in stages based on a plurality of conversion blocks.
Optionally, the obtaining, by the Vit network module in the HFE-based main branch, a result vector of the pedestrian to be identified based on the image to be identified and the basic information includes:
dividing the image to be identified into a plurality of mutually overlapped image blocks based on convolution operation of the Vit network module, and converting the image blocks into vector sequences;
and adding the position information codes and the camera information codes into the vector sequence in an element addition mode to obtain the result vector.
Optionally, before the acquiring, by the NFC module, the splice vector and extracting the pedestrian local feature vector based on the splice vector and the complementary two-dimensional feature input by the auxiliary branch, the method further includes:
and extracting the three-dimensional feature of the image to be identified based on the auxiliary branch, and integrating the two-dimensional feature corresponding to the three-dimensional feature with the inheritance class vector obtained from the main branch to obtain the supplementary two-dimensional feature.
Optionally, the secondary branches include a convolutional neural network (Convolutional Neural Networks, CNN) module, a signature remodeling (Down Sampling and Flatten, DSF) module, and a dynamic fusion (Shared Dynamic Aggregation, SDA) module;
the extracting the three-dimensional feature of the image to be identified based on the auxiliary branch, integrating the two-dimensional feature corresponding to the three-dimensional feature with the inheritance class vector obtained from the main branch, and obtaining the supplementary two-dimensional feature comprises:
extracting three-dimensional features of an input image based on a bottleneck residual block of the CNN module;
the dimension of the three-dimensional feature is reduced based on a DSF module, and a two-dimensional hierarchical feature is obtained;
and splicing the two-dimensional hierarchical features to inheritance class vectors inherited from the main branches to obtain complementary two-dimensional features.
Optionally, the HFE main branch includes a Vit network module and an NFC module;
acquiring the splice vector through the NFC module, and extracting the pedestrian local feature vector based on the splice vector and the supplementary two-dimensional feature input by the auxiliary branch comprises:
extracting a target class vector from target features output by a penultimate layer conversion block in the Vit network module based on the NFC module;
dividing a vector sequence obtained by the Vit network module into a plurality of sub-vector sequences;
respectively splicing the target class vector with the sub-vector sequence to obtain a spliced vector;
pedestrian local feature vectors are extracted in stages from the splice vector and supplemental two-dimensional features of the secondary branch input based on a shared conversion block.
Optionally, the extracting the pedestrian local feature vector from the splice vector and the complementary two-dimensional feature of the auxiliary branch input in stages based on the shared conversion block includes:
in a shallow stage, the complementary two-dimensional features interact with the spliced vectors in a one-to-one correspondence manner, and interaction results are stored as result vectors, wherein the complementary two-dimensional features comprise pedestrian texture information and edge texture information;
and in a deep stage, aggregating the supplemented two-dimensional features with the spliced vector in a mode of aggregating adjacent convolution layers to obtain a result vector, wherein the supplemented two-dimensional features comprise discriminant features.
Optionally, the determining the pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector includes:
splicing the pedestrian global feature vector and the pedestrian local feature vector to obtain a pedestrian re-recognition feature vector;
and determining a pedestrian re-recognition result based on the similarity of the pedestrian re-recognition feature vector and the target pedestrian feature vector.
Optionally, before the obtaining, based on the Vit network module, a result vector of the target pedestrian from the image to be identified and extracting a global feature vector of the pedestrian from the result vector, the method further includes:
and monitoring the complementary feature dynamic fusion network model based on the classification loss and the triplet loss in the training stage to obtain a total loss function so as to optimize the complementary feature dynamic fusion network model based on the total loss function.
In addition, the invention also provides a pedestrian re-identification device based on the complementary feature dynamic fusion network model, which comprises:
the global feature extraction module is used for acquiring a result vector of a target pedestrian from the image to be identified based on the Vit network module, and extracting a pedestrian global feature vector from the result vector;
the local feature extraction module is used for obtaining a spliced vector through the NFC module and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches;
and the recognition module is used for determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector.
Compared with the prior art, the pedestrian re-recognition method and device based on the complementary feature dynamic fusion network model provided by the invention have the advantages that the complementary feature dynamic fusion network model comprises a multi-level feature extraction main branch and an auxiliary branch, the HFE main branch comprises a vision converter network module and a neighborhood feature constraint module, the method comprises the steps of obtaining a result vector of a target pedestrian from an image to be recognized based on a Vit network module, and extracting a pedestrian global feature vector from the result vector; acquiring a spliced vector through the NFC module, and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches; and determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector. Therefore, feature fusion is carried out based on the Vit network module and the auxiliary branch module, so that a local feature vector with richer and more accurate details is obtained, and re-recognition is carried out based on the global feature vector and the local feature vector, so that the accuracy of a pedestrian re-recognition result is improved.
Drawings
FIG. 1 is a schematic hardware architecture of a pedestrian re-recognition device based on a complementary feature dynamic fusion network model according to various embodiments of the present invention;
FIG. 2 is a flow chart of a first embodiment of a pedestrian re-recognition method based on a complementary feature dynamic fusion network model of the present invention;
FIG. 3 is a diagram of a structure of a dynamic fusion network model based on complementary features, which relates to a pedestrian re-recognition method based on the dynamic fusion network model based on complementary features;
FIG. 4 is a schematic diagram of a first refinement flow of a first embodiment of a pedestrian re-recognition method based on a complementary feature dynamic fusion network model of the present invention;
FIG. 5 is a schematic flow chart of a first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model of the present invention;
FIG. 6 is a schematic diagram of a feature remodeling module involved in the pedestrian re-recognition method based on the complementary feature dynamic fusion network model of the invention;
FIG. 7 is a schematic flow chart of a second refinement of the first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model of the present invention;
FIG. 8 is a pedestrian re-recognition result image related to a first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model of the present invention;
fig. 9 is a schematic functional block diagram of a first embodiment of the pedestrian re-recognition device based on the complementary feature dynamic fusion network model.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The pedestrian re-identification device based on the complementary feature dynamic fusion network model mainly comprises network connection equipment capable of realizing network connection, and the pedestrian re-identification device based on the complementary feature dynamic fusion network model can be a server, a cloud platform and the like.
Referring to fig. 1, fig. 1 is a schematic hardware structure of a pedestrian re-recognition device based on a complementary feature dynamic fusion network model according to various embodiments of the present invention. In an embodiment of the present invention, the pedestrian re-recognition device based on the complementary feature dynamic fusion network model may include a processor 1001 (e.g. a central processing unit Central Processing Unit, a CPU), a communication bus 1002, an input port 1003, an output port 1004, and a memory 1005. Wherein the communication bus 1002 is used to enable connected communications between these components; the input port 1003 is used for data input; the output port 1004 is used for data output, and the memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory, and the memory 1005 may be an optional storage device independent of the processor 1001. Those skilled in the art will appreciate that the hardware configuration shown in fig. 1 is not limiting of the invention and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
With continued reference to FIG. 1, the memory 1005 of FIG. 1, which is a readable storage medium, may include an operating system, a network communication module, an application module, and a pedestrian re-recognition program that dynamically fuses network models based on complementary features. In fig. 1, the network communication module is mainly used for connecting with a server and performing data communication with the server; and the processor 1001 is configured to invoke the pedestrian re-recognition program based on the complementary feature dynamic fusion network model stored in the memory 1005, and perform the following operations:
acquiring a result vector of a target pedestrian from an image to be identified based on the Vit network module, and extracting a pedestrian global feature vector from the result vector;
acquiring a spliced vector through the NFC module, and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches;
and determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector.
As shown in fig. 2, a first embodiment of the present invention proposes a pedestrian re-recognition method based on a complementary feature dynamic fusion network model, where the method includes:
step S101, obtaining a result vector of a target pedestrian from an image to be identified based on the Vit network module, and extracting a pedestrian global feature vector from the result vector;
in order to improve the accuracy of pedestrian recognition, a dynamic fusion network model based on complementary features is proposed by utilizing the characterization modeling capability of convolutional neural networks (Convolutional Neural Networks, CNN) and Vision Transformer (visual transducer, viT). A complementary feature based dynamic fusion network model includes a multi-level feature extraction (Hierarchical Feature Extraction, HFE) primary branch and a secondary branch, the HFE primary branch including a vision transformer (Vision Transformer, vit) network module and a neighborhood feature constraint (Neighborhood Feature Constraint, NFC) module; the secondary branches include a convolutional neural network (Convolutional Neural Networks, CNN) module, a signature remodeling (Down Sampling and Flatten, DSF) module, and a dynamic fusion (Shared Dynamic Aggregation, SDA) module. The CNN convolutional neural network has characteristic learning (representation learning) capability and can carry out translation invariant classification on an input image according to a hierarchical structure of the CNN convolutional neural network. Vit is capable of modeling global information and has excellent performance. The dynamic fusion network model based on the complementary features is shown in fig. 3, and fig. 3 is a structural diagram of the dynamic fusion network model based on the complementary features, which relates to a pedestrian re-identification method based on the dynamic fusion network model based on the complementary features. The network model consists of two branches, the feature preference extracted by each branch is different, the auxiliary branch provides feature complementation for the main branch, and in order to ensure the training effect of dynamic fusion network of complementary features, a neighborhood feature constraint module (Neighborhood Feature Constraint, NFC) is additionally introduced into the HFE main branch to make the network concentrate on extracting local key features.
Referring to fig. 4, fig. 4 is a schematic diagram of a first refinement flow of a first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model according to the present invention, as shown in fig. 4, the step S101 includes:
step S1011, obtaining a result vector of a pedestrian to be identified based on the image to be identified and basic information based on the Vit network module in the HFE main branch;
dividing the image to be identified into a plurality of mutually overlapped image blocks based on convolution operation of the Vit network module, and converting the image blocks into vector sequences;
and adding the position information codes and the camera information codes into the vector sequence in an element addition mode to obtain the result vector.
In the Vit network of the main branch of the HFE, for a given image to be identifiedWherein H, W, C respectively represents the height, width and channel number of the image to be identified, the image to be identified x is cut into N image blocks overlapping each other with p×p size by convolution operation, and the image blocks are converted into a vector sequence for subsequent calculation of feature correlation based on the vector sequence, which is expressed as:
where N represents the length of the vector sequence,s represents the step length of convolution operation, S < P; (P) 2 X C) represents the dimension of the vector sequence.
Reuse of a trainable linear mapping will (P) 2 X C) dimension vector sequence maps to d dimension, in this embodiment the value of d takes 2.
After the d-dimensional vector sequence is obtained, a learnable one-dimensional position information code is added to the vector sequence according to a standard Vision Transformer in an element addition mode, and a camera information code is added in the same mode to obtain a result vector, wherein the result vector can be expressed as
Step S1012, extracting pedestrian global feature vectors from the result vectors in stages based on a plurality of conversion blocks.
After the result vector is obtained, pedestrian global features are extracted in stages through 12 conversion blocks (Transformer Block) in the Vit network respectively. With continued reference to fig. 3, the shallow SDA in the Vit network 1-1×9 And deep SDA 2-1×2 、SDA 3-1×1 The extraction of the global features of pedestrians is respectively carried out. The CLS (class) of the Vit main branch encodes the statistical properties of the collection, thereby being an overall characterization of pedestrian identity.
Step S102, a spliced vector is obtained through the NFC module, and a pedestrian local feature vector is extracted based on the spliced vector and supplementary two-dimensional features input by auxiliary branches;
in this embodiment, the complementary two-dimensional features need to be obtained in advance based on the auxiliary branches, referring to fig. 5, fig. 5 is a schematic flow diagram of a first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model according to the present invention, as shown in fig. 5, before step S102, the method further includes:
step S1002, extracting a three-dimensional feature of the image to be identified based on the auxiliary branch, and integrating a two-dimensional feature corresponding to the three-dimensional feature with an inheritance class vector obtained from the main branch to obtain the complementary two-dimensional feature.
The three-dimensional characteristics of the input image are extracted based on the bottleneck residual block of the CNN module;
the dimension of the three-dimensional feature is reduced based on a DSF (feature remodeling) module, and a two-dimensional hierarchical feature is obtained;
and splicing the two-dimensional hierarchical features to inheritance class vectors inherited from the main branches to obtain complementary two-dimensional features.
The secondary branches include a CNN convolutional neural network (Convolutional Neural Networks, CNN) module, a signature remodeling (Down Sampling and Flatten, DSF) module, and a dynamic fusion (Shared Dynamic Aggregation, SDA) module. The secondary branches consist of CNN networks stacked with bottleneck residual modules consisting of small convolutions. The present embodiment features different convolution layers as H ih, w and c represent the length, width and channel number of the feature map, respectively. In order to further improve the capability of the Vit network in the main branch of the HFE to select complementary features on the auxiliary branch and activate the features of the high discriminant region, the present embodiment proposes a shared dynamic aggregation module (Shared Dynamic Aggregation, SDA) composed of attention modules to provide detail information of different scales for the HFE.
In a specific implementation process, heterogeneous features are dynamically fused beforeIt is necessary to solve heterogeneous characteristics (i.e. H in the secondary branch i And the result vector Z) in the primary branch, in the feature dimension, features captured in the secondary branchIs three-dimensional and the length h, width w and channel number c in different stages are different. The features in HFE are two-dimensional, +.>And the hierarchy characteristics of different stages of the auxiliary branch are always unchanged, and the current three-dimensional hierarchy characteristics H are subjected to feature remodelling modules (Down Sampling and Flatten, DSF downsampling and flattening) before entering any SDA module i Conversion to a fixed two-dimensional feature f i . Referring to fig. 6, fig. 6 is a schematic diagram of a feature remodeling module related to a pedestrian re-recognition method based on a complementary feature dynamic fusion network model of the present invention, where the feature remodeling module includes convolutional layer content (post-normalization processing), average pooling layer Avgpool, flattened flat and Transpose Transpore operations as illustrated in fig. 6.
Two-dimensional feature f i The class vector in the main branch is inherited, the inherited class vector is defined as an inherited class vector (ICLS), and the inherited class vector is respectively spliced to each two-dimensional feature f i And the complementary two-dimensional characteristics are obtained, so that the fusion of the CNN network and the Vit network is realized, and the fusion utilization of the local characteristics and the global characteristics is realized.
And after the complementary two-dimensional features are obtained, extracting pedestrian local feature vectors based on the spliced vectors and the complementary two-dimensional features input by the auxiliary branches. Referring specifically to fig. 7, fig. 7 is a schematic flow chart of a second refinement of the first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model of the present invention; as shown in fig. 7, the step S102 includes:
step S1021, obtaining a stitching vector by the NFC module, and extracting a pedestrian local feature vector based on the stitching vector and the complementary two-dimensional feature input by the auxiliary branch includes:
step S1022, extracting a target class vector from target features output by a penultimate layer conversion block in the Vit network module based on the NFC module;
step S1023, dividing the vector sequence obtained by the Vit network module into a plurality of sub-vector sequences;
step S1024, splicing the target class vector with the sub-vector sequence to obtain a spliced vector;
step S1025, extracting pedestrian local feature vectors from the splice vector and the supplementary two-dimensional features of the auxiliary branch input in stages based on the shared conversion block.
In order to extract the pedestrian local fine granularity feature, the embodiment adds an NFC module in the HFE main branch. The embodiment extracts the target class vector from the target feature output by the penultimate layer conversion block in the Vit network module based on the NFC module.
And dividing the vector sequence into S groups along the length direction of the vector sequence to obtain S sub-vector sequences corresponding to S local neighborhood features, wherein S refers to the proportion of the neighborhood feature region to the last hidden layer feature of the Vit network in the HFE main branch. The vector sequence of this embodiment is expressed as:
splicing the sub-vector sequences representing the local neighborhood characteristics with the target class vectors to obtain splicing vectors, and representing the splicing vectors as:
in this way, local neighborhood features including target class vector features are obtained.
And in the shared conversion block shared by the splice vector and the supplementary two-dimensional feature input by the auxiliary branch, calculating the correlation of the features in each local area, and extracting S pedestrian local feature vectors in stages based on the shared conversion block:
specifically, in a shallow stage, the complementary two-dimensional features and the spliced vector are interacted in a one-to-one correspondence manner, and an interaction result is stored as a result vector, wherein the complementary two-dimensional features comprise pedestrian texture information and edge texture information;
and in a deep stage, aggregating the supplemented two-dimensional features with the spliced vector in a mode of aggregating adjacent convolution layers to obtain a result vector, wherein the supplemented two-dimensional features comprise discriminant features.
The present embodiment learns features within the SDA module that have dependencies and complementarity to the class vector CLS. The SDA module consists of a multi-head attention mechanism, a multi-layer perceptron, layer normalization and residual error splicing.
The need for complementary features is different at different stages of the Vit network of the main branch of the HFE, enough detail cues help to build high quality body part relationships in the shallow layer of the HFE, pedestrian texture information and edge texture information captured by the shallow layer of the auxiliary branch are relatively complete and the pedestrian semantic information is clear, so for the first 9 Transformer Block of the HFE, complementary hierarchical features interact with Transformer Block output features in the HFE in a one-to-one correspondence, as in SDA in fig. 3 1-1 As shown. In the deep layer of the HFE, diversified distinguishing features need to be concentrated on rough pedestrian body parts, at this time, the features extracted from the deep layer of the auxiliary branches have abstract properties, and distinguishing features of different modes of pedestrians are extracted from different channels. In the deep layer of the Vit network, the auxiliary branch feature can activate the discriminant feature in the HFE to increase the convergence rate, which is achieved by aggregating the features of adjacent convolution layers, such as SDA in FIG. 3 2-1 And SDA (integrated data access) 3-1 As shown.
And step S103, determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector.
Specifically, the pedestrian global feature vector and the pedestrian local feature vector are spliced to obtain a pedestrian re-identification feature vector; and determining a pedestrian re-recognition result based on the similarity of the pedestrian re-recognition feature vector and the target pedestrian feature vector.
The embodiment splices the obtained pedestrian global feature vector and the pedestrian local feature vector to obtain a final pedestrian re-identification feature vector. And performing similarity comparison on each image to be identified and the target pedestrian image to obtain an identification distance matrix, wherein the image to be identified with the minimum distance with the target pedestrian image is regarded as the most probable candidate pedestrian. The euclidean distance of two samples in the embedding space is typically calculated as the similarity, since euclidean distance is the simplest and easiest to understand.
The present embodiment defines x p And x g Representing the target pedestrian image (probe) and the image to be identified in the candidate set gamma respectively,and->The pedestrian re-recognition feature vector of the target pedestrian image and the pedestrian re-recognition feature vector of the image to be recognized are respectively represented. />And->Is expressed as D pg The following steps are:
the method is used for re-identification, and an accurate identification result can be obtained. Referring to fig. 8, fig. 8 is a pedestrian re-recognition result image related to a first embodiment of the pedestrian re-recognition method based on the complementary feature dynamic fusion network model of the present invention; as can be seen from the recognition result based on fig. 8, the number of error images retrieved based on the base line network of Vision Transformer (ViT-BoT) is larger than the model proposed by the present invention. Furthermore, among the images retrieved in the present application, the first few images (especially rank-5) are more accurate than the base line network based on Vision Transformer (ViT-BoT). The average accuracy average value reaches 66.6%, which shows the effectiveness and accuracy of pedestrian re-identification based on the complementary feature dynamic fusion network model.
The pedestrian re-recognition task based on deep learning is mainly divided into two stages, wherein the first stage is training of a network, and the second stage is extracting features of a target pedestrian and a pedestrian image in a candidate set by using a trained model to carry out similarity measurement. In the training phase, this embodiment sets only one classifier for the HFE main branch based on the complementary feature dynamic fusion network model. The loss calculated by the main branch classifier is optimized for the whole network.
Specifically, before step S101, the complementary feature dynamic fusion network model is supervised based on classification loss and triplet loss in a training phase, so as to obtain a total loss function, so as to optimize the complementary feature dynamic fusion network model based on the total loss function.
In the training phase, the complementary feature-based dynamic fusion network model is supervised by using classification (ID) Loss and triplet Loss, and the total Loss function of the model is denoted as L, then there is:
where α is a balance factor, typically set to 1; l (L) ID Representing the classification loss, L tri Representing triplet loss, L tri (CLS) represents the triplet loss of CLS,classification loss representing local neighborhood features, +.>The triplet loss representing the local neighborhood feature, in the inference phase, the features in the HFE branch and the NFC sub-branch are stitched together as the final representation of the pedestrian. And obtaining a candidate list matched with the pedestrian to be searched according to the finally represented similarity matrix.
According to the scheme, the Vit network module is based on the fact that the result vector of the target pedestrian is obtained from the image to be identified, and the pedestrian global feature vector is extracted from the result vector; acquiring a spliced vector through the NFC module, and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches; and determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector. Therefore, feature fusion is carried out based on the Vit network module and the auxiliary branch module, so that a local feature vector with richer and more accurate details is obtained, and re-recognition is carried out based on the global feature vector and the local feature vector, so that the accuracy of a pedestrian re-recognition result is improved.
In addition, the invention also provides a pedestrian re-recognition device based on the complementary feature dynamic fusion network model, referring to fig. 9, fig. 9 is a schematic functional block diagram of a first embodiment of the pedestrian re-recognition device based on the complementary feature dynamic fusion network model of the invention, as shown in fig. 9, the pedestrian re-recognition device based on the complementary feature dynamic fusion network model comprises:
the global feature extraction module 10 is configured to obtain a result vector of a target pedestrian from an image to be identified based on the Vit network module, and extract a pedestrian global feature vector from the result vector;
the local feature extraction module 20 is configured to obtain a stitching vector through the NFC module, and extract a pedestrian local feature vector based on the stitching vector and the complementary two-dimensional feature input by the auxiliary branch;
the identifying module 30 is configured to determine a pedestrian re-identifying result based on the pedestrian global feature vector and the pedestrian local feature vector.
Further, the global feature extraction module 10 includes:
the result vector obtaining unit is used for obtaining a result vector of a pedestrian to be identified based on the image to be identified and basic information based on the Vit network module in the HFE main branch;
and the global feature extraction unit is used for extracting the pedestrian global feature vector from the result vector in stages based on a plurality of conversion blocks.
Further, the result vector obtaining unit includes:
the vector sequence obtaining module is used for dividing the image to be identified into a plurality of mutually overlapped image blocks based on the convolution operation of the Vit network module, and converting the image blocks into a vector sequence;
and the adding unit is used for adding the position information codes and the camera information codes into the vector sequence in an element addition mode to obtain the result vector.
Further, the local feature extraction module 20 further includes:
and the integration module is used for extracting the three-dimensional feature of the image to be identified based on the auxiliary branch, and integrating the two-dimensional feature corresponding to the three-dimensional feature with the inheritance class vector obtained from the main branch to obtain the supplementary two-dimensional feature.
Further, the integration module includes:
the three-dimensional feature extraction unit is used for extracting three-dimensional features of the input image based on the bottleneck residual error block of the CNN module;
the two-dimensional hierarchical feature obtaining unit is used for reducing the dimension of the three-dimensional feature based on the DSF module to obtain a two-dimensional hierarchical feature;
the splicing unit is used for splicing the two-dimensional hierarchical features to inheritance class vectors inherited from the main branches to obtain complementary two-dimensional features;
further, the local feature extraction module 20 includes:
the target class vector extraction unit is used for extracting a target class vector from target features output by a penultimate layer conversion block in the Vit network module based on the NFC module;
the segmentation unit is used for segmenting the vector sequence obtained by the Vit network module into a plurality of sub-vector sequences;
the spliced vector obtaining unit is used for respectively splicing the target class vector with the sub-vector sequence to obtain a spliced vector;
and the extraction unit is used for extracting the pedestrian local feature vector from the spliced vector and the supplementary two-dimensional feature input by the auxiliary branch in stages based on the shared conversion block.
Further, the extraction unit includes:
the interaction unit is used for carrying out one-to-one corresponding interaction on the complementary two-dimensional features and the spliced vector in a shallow stage, and storing an interaction result as a result vector, wherein the complementary two-dimensional features comprise pedestrian texture information and edge texture information;
and the aggregation unit is used for aggregating the complementary two-dimensional features with the spliced vector in a deep stage in a mode of aggregating adjacent convolution layers to obtain a result vector, wherein the complementary two-dimensional features comprise discriminant features.
Further, the identification module 30 includes:
the characteristic splicing unit is used for splicing the pedestrian global characteristic vector and the pedestrian local characteristic vector to obtain a pedestrian re-identification characteristic vector;
and the determining unit is used for determining a pedestrian re-recognition result based on the similarity between the pedestrian re-recognition feature vector and the target pedestrian feature vector.
Further, the global feature extraction module 10 further includes:
and the loss module is used for monitoring the complementary feature dynamic fusion network model based on the classification loss and the triplet loss in the training stage to obtain a total loss function so as to optimize the complementary feature dynamic fusion network model based on the total loss function.
In addition, the invention further provides a computer readable storage medium, the computer readable storage medium stores a pedestrian re-identification program based on the complementary feature dynamic fusion network model, and the steps of the pedestrian re-identification method based on the complementary feature dynamic fusion network model are realized when the pedestrian re-identification program based on the complementary feature dynamic fusion network model is run by a processor, and are not repeated herein.
Compared with the prior art, the pedestrian re-recognition method and device for the complementary feature dynamic fusion network model are provided, the complementary feature dynamic fusion network model comprises a multi-level feature extraction main branch and an auxiliary branch, the HFE main branch comprises a vision converter network module and a neighborhood feature constraint module, the method comprises the steps of obtaining a result vector of a target pedestrian from an image to be recognized based on a Vit network module, and extracting a pedestrian global feature vector from the result vector; acquiring a spliced vector through the NFC module, and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches; and determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector. Therefore, feature fusion is carried out based on the Vit network module and the auxiliary branch module, so that a local feature vector with richer and more accurate details is obtained, and re-recognition is carried out based on the global feature vector and the local feature vector, so that the accuracy of a pedestrian re-recognition result is improved.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or modifications in the structures or processes described in the specification and drawings, or the direct or indirect application of the present invention to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. Pedestrian re-recognition method based on a complementary feature dynamic fusion network model, characterized in that the complementary feature dynamic fusion network model comprises a multi-level feature extraction (Hierarchical Feature Extraction, HFE) main branch and an auxiliary branch, wherein the HFE main branch comprises a vision transformer (Vision Transformer, vit) network module and a neighborhood feature constraint (Neighborhood Feature Constraint, NFC) module;
the pedestrian re-identification method based on the complementary feature dynamic fusion network model comprises the following steps:
acquiring a result vector of a target pedestrian from an image to be identified based on the Vit network module, and extracting a pedestrian global feature vector from the result vector;
acquiring a spliced vector through the NFC module, and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches;
and determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector.
2. The method of claim 1, wherein the obtaining, based on the Vit network module, a result vector of a target pedestrian from an image to be identified, and extracting a pedestrian global feature vector from the result vector comprises:
obtaining a result vector of a pedestrian to be identified based on the image to be identified and basic information based on the Vit network module in the HFE main branch;
pedestrian global feature vectors are extracted from the result vectors in stages based on a plurality of conversion blocks.
3. The method of claim 2, wherein the Vit network module in the HFE based main branch obtaining a result vector of a pedestrian to be identified based on an image to be identified and basic information comprises:
dividing the image to be identified into a plurality of mutually overlapped image blocks based on convolution operation of the Vit network module, and converting the image blocks into vector sequences;
and adding the position information codes and the camera information codes into the vector sequence in an element addition mode to obtain the result vector.
4. The method of claim 1, wherein before obtaining a splice vector by the NFC module and extracting a pedestrian local feature vector based on the splice vector and the complementary two-dimensional feature of the secondary branch input, further comprising:
and extracting the three-dimensional feature of the image to be identified based on the auxiliary branch, and integrating the two-dimensional feature corresponding to the three-dimensional feature with the inheritance class vector obtained from the main branch to obtain the supplementary two-dimensional feature.
5. The method of claim 4, wherein the secondary branches comprise a convolutional neural network (Convolutional Neural Networks, CNN) module, a signature remodeling (Down Sampling and Flatten, DSF) module, and a dynamic fusion (Shared Dynamic Aggregation, SDA) module;
the extracting the three-dimensional feature of the image to be identified based on the auxiliary branch, integrating the two-dimensional feature corresponding to the three-dimensional feature with the inheritance class vector obtained from the main branch, and obtaining the supplementary two-dimensional feature comprises:
extracting three-dimensional features of an input image based on a bottleneck residual block of the CNN module;
the dimension of the three-dimensional feature is reduced based on a DSF module, and a two-dimensional hierarchical feature is obtained;
and splicing the two-dimensional hierarchical features to inheritance class vectors inherited from the main branches to obtain complementary two-dimensional features.
6. The method according to claim 1, wherein the HFE main branch comprises a Vit network module and an NFC module;
acquiring the splice vector through the NFC module, and extracting the pedestrian local feature vector based on the splice vector and the supplementary two-dimensional feature input by the auxiliary branch comprises:
extracting a target class vector from target features output by a penultimate layer conversion block in the Vit network module based on the NFC module;
dividing a vector sequence obtained by the Vit network module into a plurality of sub-vector sequences;
respectively splicing the target class vector with the sub-vector sequence to obtain a spliced vector;
pedestrian local feature vectors are extracted in stages from the splice vector and supplemental two-dimensional features of the secondary branch input based on a shared conversion block.
7. The method of claim 6, wherein extracting pedestrian local feature vectors from the splice vector and supplemental two-dimensional features of secondary branch inputs in stages based on a shared conversion block comprises:
in a shallow stage, the complementary two-dimensional features interact with the spliced vectors in a one-to-one correspondence manner, and interaction results are stored as result vectors, wherein the complementary two-dimensional features comprise pedestrian texture information and edge texture information;
and in a deep stage, aggregating the supplemented two-dimensional features with the spliced vector in a mode of aggregating adjacent convolution layers to obtain a result vector, wherein the supplemented two-dimensional features comprise discriminant features.
8. The method of claim 1, wherein the determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector comprises:
splicing the pedestrian global feature vector and the pedestrian local feature vector to obtain a pedestrian re-recognition feature vector;
and determining a pedestrian re-recognition result based on the similarity of the pedestrian re-recognition feature vector and the target pedestrian feature vector.
9. The method of claim 1, wherein before the obtaining, based on the Vit network module, a result vector of a target pedestrian from an image to be identified and extracting a global feature vector of the pedestrian from the result vector, the method further comprises:
and monitoring the complementary feature dynamic fusion network model based on the classification loss and the triplet loss in the training stage to obtain a total loss function so as to optimize the complementary feature dynamic fusion network model based on the total loss function.
10. The pedestrian re-identification device based on the complementary feature dynamic fusion network model is characterized by comprising:
the global feature extraction module is used for acquiring a result vector of a target pedestrian from the image to be identified based on the Vit network module, and extracting a pedestrian global feature vector from the result vector;
the local feature extraction module is used for obtaining a spliced vector through the NFC module and extracting a pedestrian local feature vector based on the spliced vector and supplementary two-dimensional features input by auxiliary branches;
and the recognition module is used for determining a pedestrian re-recognition result based on the pedestrian global feature vector and the pedestrian local feature vector.
CN202311762029.9A 2023-12-19 2023-12-19 Pedestrian re-recognition method and device based on complementary feature dynamic fusion network model Pending CN117746462A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311762029.9A CN117746462A (en) 2023-12-19 2023-12-19 Pedestrian re-recognition method and device based on complementary feature dynamic fusion network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311762029.9A CN117746462A (en) 2023-12-19 2023-12-19 Pedestrian re-recognition method and device based on complementary feature dynamic fusion network model

Publications (1)

Publication Number Publication Date
CN117746462A true CN117746462A (en) 2024-03-22

Family

ID=90250513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311762029.9A Pending CN117746462A (en) 2023-12-19 2023-12-19 Pedestrian re-recognition method and device based on complementary feature dynamic fusion network model

Country Status (1)

Country Link
CN (1) CN117746462A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128265A (en) * 2019-12-30 2021-07-16 华为技术有限公司 Figure identification method and device
CN114299542A (en) * 2021-12-29 2022-04-08 北京航空航天大学 Video pedestrian re-identification method based on multi-scale feature fusion
CN115393953A (en) * 2022-07-28 2022-11-25 深圳职业技术学院 Pedestrian re-identification method, device and equipment based on heterogeneous network feature interaction
CN115909201A (en) * 2022-11-11 2023-04-04 复旦大学 Method and system for re-identifying blocked pedestrians based on multi-branch joint learning
CN116110118A (en) * 2022-11-08 2023-05-12 西安电子科技大学 Pedestrian re-recognition and gait recognition method based on space-time feature complementary fusion
WO2023134071A1 (en) * 2022-01-12 2023-07-20 平安科技(深圳)有限公司 Person re-identification method and apparatus, electronic device and storage medium
CN117095319A (en) * 2022-05-11 2023-11-21 华为技术有限公司 Target positioning method, system and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128265A (en) * 2019-12-30 2021-07-16 华为技术有限公司 Figure identification method and device
CN114299542A (en) * 2021-12-29 2022-04-08 北京航空航天大学 Video pedestrian re-identification method based on multi-scale feature fusion
WO2023134071A1 (en) * 2022-01-12 2023-07-20 平安科技(深圳)有限公司 Person re-identification method and apparatus, electronic device and storage medium
CN117095319A (en) * 2022-05-11 2023-11-21 华为技术有限公司 Target positioning method, system and electronic equipment
CN115393953A (en) * 2022-07-28 2022-11-25 深圳职业技术学院 Pedestrian re-identification method, device and equipment based on heterogeneous network feature interaction
CN116110118A (en) * 2022-11-08 2023-05-12 西安电子科技大学 Pedestrian re-recognition and gait recognition method based on space-time feature complementary fusion
CN115909201A (en) * 2022-11-11 2023-04-04 复旦大学 Method and system for re-identifying blocked pedestrians based on multi-branch joint learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JISEOB KIM ET AL.: "DATA-INDEPENDENT MODULE-AWARE PRUNING FOR HIERARCHICAL VISION TRANSFORMERS", 《ARXIV》, 2 February 2023 (2023-02-02), pages 1 - 21 *
YANCHAO LI ET AL.: "Heterogeneous feature-aware Transformer-CNN coupling network for person re-identification", 《PEERJ COMPUTER SCIENCE 》, 27 September 2022 (2022-09-27), pages 1 - 23 *
吴绍君等: "基于多层次深度学习网络的行人重识别", 《山东师范大学学报(自然科学版)》, vol. 35, no. 02, 15 June 2020 (2020-06-15), pages 208 - 206 *
熊炜等: "基于深层特征融合的行人重识别方法", 《计算机工程与科学》, vol. 42, no. 02, 15 February 2020 (2020-02-15), pages 358 - 364 *
连国云等: "基于显著性区域检测及IWLD特征的快速行人检测", 《现代计算机(专业版)》, no. 20, 15 July 2017 (2017-07-15), pages 59 - 61 *

Similar Documents

Publication Publication Date Title
Zhou et al. MFFENet: Multiscale feature fusion and enhancement network for RGB–thermal urban road scene parsing
Abbas et al. A comprehensive review of recent advances on deep vision systems
WO2019001481A1 (en) Vehicle appearance feature identification and vehicle search method and apparatus, storage medium, and electronic device
CN111709311A (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN110543841A (en) Pedestrian re-identification method, system, electronic device and medium
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
Tang et al. Multi-modal metric learning for vehicle re-identification in traffic surveillance environment
CN114663670A (en) Image detection method and device, electronic equipment and storage medium
CN113836992B (en) Label identification method, label identification model training method, device and equipment
CN112541448B (en) Pedestrian re-identification method and device, electronic equipment and storage medium
CN114170516B (en) Vehicle weight recognition method and device based on roadside perception and electronic equipment
CN112149538A (en) Pedestrian re-identification method based on multi-task learning
Abdulnabi et al. Multimodal recurrent neural networks with information transfer layers for indoor scene labeling
CN114764869A (en) Multi-object detection with single detection per object
CN114863407A (en) Multi-task cold start target detection method based on visual language depth fusion
CN113221977B (en) Small sample semantic segmentation method based on anti-aliasing semantic reconstruction
Liao et al. Multi-scale saliency features fusion model for person re-identification
WO2024027347A9 (en) Content recognition method and apparatus, device, storage medium, and computer program product
Lai et al. Deep siamese network for low-resolution face recognition
CN113159053A (en) Image recognition method and device and computing equipment
CN117315249A (en) Image segmentation model training and segmentation method, system, equipment and medium
CN117453949A (en) Video positioning method and device
Yao et al. Learning latent stable patterns for image understanding with weak and noisy labels
CN117746462A (en) Pedestrian re-recognition method and device based on complementary feature dynamic fusion network model
CN115098646A (en) Multilevel relation analysis and mining method for image-text data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination