CN115393953B

CN115393953B - Pedestrian re-recognition method, device and equipment based on heterogeneous network feature interaction

Info

Publication number: CN115393953B
Application number: CN202210897792.1A
Authority: CN
Inventors: 连国云; 李焱超; 张文宇; 杨金锋
Original assignee: Shenzhen Polytechnic
Current assignee: Shenzhen Polytechnic
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2023-08-08
Anticipated expiration: 2042-07-28
Also published as: WO2024021283A1; CN115393953A

Abstract

The invention discloses a pedestrian re-identification method, device and equipment based on heterogeneous network feature interaction, belonging to the technical field of image processing, wherein the method comprises the following steps: designing a pedestrian re-identification initial model based on heterogeneous network characteristics of a convolutional neural network and a visual transformer; calculating a loss value of the pedestrian re-recognition initial model based on the double loss, determining that the pedestrian re-recognition initial model converges and stopping training based on the loss value, and obtaining a pedestrian re-recognition model; and re-identifying the target pedestrian image based on the pedestrian re-identification model. The pedestrian re-recognition model is built based on the heterogeneous network characteristics of the convolutional neural network and the visual transformer, and the shallow characteristic characteristics and the deep characteristic characteristics are fused, so that the basic characteristics of the image can be utilized, the global characteristics of the image can be utilized, a large number of image characteristics can be obtained, and the recognition result is more accurate.

Description

Pedestrian re-recognition method, device and equipment based on heterogeneous network feature interaction

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a pedestrian re-recognition method, device and equipment based on heterogeneous network feature interaction.

Background

As a biological recognition technology, pedestrian Re-recognition (ReID) is different from unique marks such as face recognition, iris recognition and fingerprint recognition, and mainly depends on the apparent characteristics of pedestrians, and is closely related to the appearance characteristics such as clothing, gestures and the like, so that the pedestrian Re-recognition has wider prospects in applications such as investigation security and pedestrian behavior analysis.

Since pedestrians can freely go in and out within the range of the camera, the camera is not limited by any condition. The captured pedestrian images are seriously affected by the differences of the shooting visual angles, illumination changes, object shielding, noisy backgrounds and other environmental factors, so that the appearance characteristics (resolution, gesture and the like) of the same pedestrian in different images are greatly different, and even the complete pedestrian images cannot be captured due to the object shielding. In addition, there are many visual similarities between different individuals in terms of clothing color, style, etc., which add more difficulty to the identification.

The current pedestrian re-recognition technology mainly comprises the steps of directly inputting an image into a convolutional neural network to obtain global features of pedestrians, and using the global features to recognize the pedestrians. However, due to factors such as noise, insufficient number of features, and the like, a good recognition result is often not obtained. Although there are some improved methods, in practical applications it is difficult to obtain sufficient pose-tagged character images and powerful pose estimates. The improved approach is easily limited by the additional auxiliary model. The rough division mode cannot guarantee the effectiveness of dividing blocks, and effective blocks cannot be filtered, so that the judgment of outliers is very easy to generate errors. The attention obtained by the convolution operation ignores global information and implicit relationships of the image, thereby limiting exploration of correlations between model learning features.

Disclosure of Invention

The invention provides a pedestrian re-recognition method, device, equipment and storage medium based on heterogeneous network feature interaction, and aims to solve the technical problem of poor effect caused by pedestrian re-recognition by using global features and improve the accuracy of pedestrian re-recognition.

In order to achieve the above purpose, the present invention provides a pedestrian re-recognition method based on heterogeneous network feature interaction, the method comprising:

designing a pedestrian re-identification initial model based on heterogeneous network characteristics of a convolutional neural network and a visual transformer;

calculating a loss value of the pedestrian re-recognition initial model based on the double loss, determining that the pedestrian re-recognition initial model converges and stopping training based on the loss value, and obtaining a pedestrian re-recognition model;

re-identifying the target pedestrian image based on the pedestrian re-identification model

Optionally, the step of designing the pedestrian re-recognition initial model based on the heterogeneous network characteristics of the convolutional neural network and the visual transformer includes:

constructing a convolutional neural network branch of the pedestrian re-recognition initial model; and

constructing a visual transformer branch of the pedestrian re-identification initial model;

and fusing the shallow heterogeneous characteristics of the convolutional neural network branch and the deep heterogeneous characteristics of the visual transformer branch to obtain the pedestrian re-identification initial model.

Optionally, the constructing the visual transformer branch of the pedestrian re-recognition initial model includes:

representing an input pedestrian image as an image block sequence including a plurality of image blocks;

performing linear mapping on the image block sequence to obtain a plurality of D-dimensional embedded representations of the image blocks;

concatenating a class token with a plurality of said D-dimensional embedded representations and adding a position code and a camera code for each of said image blocks to produce a sequence of embedded image blocks;

and sequentially processing the embedded image block sequence through normalization, a multi-head attention mechanism and a multi-layer perceptron to obtain the visual transformer branch.

Optionally, the fusing the shallow heterogeneous feature of the convolutional neural network branch and the deep heterogeneous feature of the vision transformer branch includes:

and transforming the three-dimensional shallow heterogeneous characteristics of the convolutional neural network branch into two dimensions through convolution of 1 multiplied by 1, carrying out global average pooling operation on the shallow heterogeneous characteristics of the convolutional neural network branch to reserve focal characteristics, and flowing the focal characteristics into the vision transformer branch.

Optionally, the fusing the shallow heterogeneous feature of the convolutional neural network branch and the deep heterogeneous feature of the vision transformer branch further includes:

carrying out dimension alignment on deep heterogeneous features of the visual transformer branch through 1 multiplied by 1 convolution to obtain three-dimensional deep heterogeneous features, carrying out normalization processing on the three-dimensional deep heterogeneous features, carrying out feature resolution alignment on the basis of interpolation to obtain features to be exchanged, and flowing the features to be exchanged into the convolution neural network branch;

and splicing the global feature vectors obtained by the convolutional neural network branch and the visual transformer branch to obtain the pedestrian re-identification feature vector.

Optionally, the calculating the loss value of the pedestrian re-recognition initial model based on the double loss, determining that the pedestrian re-recognition initial model converges and stops training based on the loss value includes:

setting a first classifier for calculating a branch loss function of the convolutional neural network, and setting a second classifier for calculating a branch loss function of the visual transformer;

and determining that the pedestrian re-recognition initial model converges and stopping training based on the sum of the first loss function calculated by the first classifier and the second loss function obtained by the second classifier.

Optionally, the re-identifying the target pedestrian image based on the pedestrian re-identification model includes:

performing similarity measurement on the characteristics of the target pedestrian image and the plurality of candidate pedestrian images based on the pedestrian re-recognition model to obtain a recognition distance matrix;

the candidate pedestrian in the candidate pedestrian image corresponding to the minimum recognition distance matrix is determined as the target pedestrian.

Optionally, after calculating the loss value of the pedestrian re-recognition initial model based on the double loss, determining that the pedestrian re-recognition initial model converges and stops training based on the loss value, and obtaining the pedestrian re-recognition model, the method further includes:

testing the pedestrian re-identification model to obtain an evaluation index;

and carrying out a comparison experiment on the network structure parameters of the pedestrian re-recognition model based on the evaluation index, and determining target network structure parameters so as to optimize the pedestrian re-recognition model based on the target network structure parameters.

The embodiment of the invention also provides a pedestrian re-identification device based on heterogeneous network feature interaction, which comprises:

the model construction module is used for designing a pedestrian re-identification initial model based on the heterogeneous network characteristics of the convolutional neural network and the visual transformer;

the calculation module is used for calculating the loss value of the pedestrian re-recognition initial model based on the double loss, determining that the pedestrian re-recognition initial model converges and stopping training based on the loss value, and obtaining a pedestrian re-recognition model;

and the re-recognition module is used for re-recognizing the target pedestrian image based on the pedestrian re-recognition model.

The embodiment of the invention also provides pedestrian re-recognition equipment based on the heterogeneous network characteristic interaction, which comprises a memory, a processor and a pedestrian re-recognition program based on the heterogeneous network characteristic interaction stored in the memory, wherein the pedestrian re-recognition program based on the heterogeneous network characteristic interaction realizes the steps of the pedestrian re-recognition method based on the heterogeneous network characteristic interaction when being run by the processor.

Compared with the prior art, the pedestrian re-recognition method, the device, the equipment and the storage medium based on the heterogeneous network feature interaction provided by the invention design a pedestrian re-recognition initial model based on the heterogeneous network features of the convolutional neural network and the visual transformer; calculating a loss value of the pedestrian re-recognition initial model based on the double loss, determining that the pedestrian re-recognition initial model converges and stopping training based on the loss value, and obtaining a pedestrian re-recognition model; and re-identifying the target pedestrian image based on the pedestrian re-identification model. The pedestrian re-recognition model is built based on the heterogeneous network characteristics of the convolutional neural network and the visual transformer, and the shallow characteristic characteristics and the deep characteristic characteristics are fused, so that the basic characteristics of the image can be utilized, the global characteristics of the image can be utilized, a large number of image characteristics can be obtained, and the recognition result is more accurate.

Drawings

FIG. 1 is a schematic hardware architecture of a pedestrian re-recognition device based on heterogeneous network feature interaction according to embodiments of the present invention;

FIG. 2 is a schematic flow chart of a first embodiment of a pedestrian re-recognition method based on heterogeneous network feature interaction of the present invention;

FIG. 3 is a flow chart of a second embodiment of the pedestrian re-recognition method based on heterogeneous network feature interaction of the present invention;

FIG. 4 is a schematic diagram of a pedestrian re-recognition model involved in the pedestrian re-recognition method based on heterogeneous network feature interaction of the invention;

FIG. 5 is a schematic diagram of feature interactions involved in the pedestrian re-recognition method based on heterogeneous network feature interactions of the present invention;

FIG. 6 is a flow chart of a third embodiment of a pedestrian re-recognition method based on heterogeneous network feature interaction of the present invention;

FIG. 7 is a flow chart of a fourth embodiment of a pedestrian re-recognition method based on heterogeneous network feature interaction of the present invention;

FIG. 8 is a flowchart of a fifth embodiment of a pedestrian re-recognition method based on heterogeneous network feature interaction of the present invention;

fig. 9 is a schematic functional block diagram of a first embodiment of the pedestrian re-recognition device based on heterogeneous network feature interaction according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The pedestrian re-identification device based on heterogeneous network feature interaction mainly related to the embodiment of the invention refers to network connection equipment capable of realizing network connection, and the pedestrian re-identification device based on heterogeneous network feature interaction can be a server, a cloud platform and the like.

Referring to fig. 1, fig. 1 is a schematic hardware structure of a pedestrian re-recognition device based on heterogeneous network feature interaction according to embodiments of the present invention. In an embodiment of the present invention, the pedestrian re-recognition device based on heterogeneous network feature interaction may include a processor 1001 (e.g. a central processing unit Central Processing Unit, a CPU), a communication bus 1002, an input port 1003, an output port 1004, and a memory 1005. Wherein the communication bus 1002 is used to enable connected communications between these components; the input port 1003 is used for data input; the output port 1004 is used for data output, and the memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory, and the memory 1005 may be an optional storage device independent of the processor 1001. Those skilled in the art will appreciate that the hardware configuration shown in fig. 1 is not limiting of the invention and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

With continued reference to FIG. 1, the memory 1005 of FIG. 1, which is a readable storage medium, may include an operating system, a network communication module, an application module, and a pedestrian re-recognition program based on heterogeneous network feature interactions. In fig. 1, the network communication module is mainly used for connecting with a server and performing data communication with the server; and the processor 1001 is configured to invoke the pedestrian re-recognition program based on heterogeneous network feature interaction stored in the memory 1005, and perform the following operations:

and re-identifying the target pedestrian image based on the pedestrian re-identification model.

The pedestrian re-recognition device based on the heterogeneous network feature interaction provides a first embodiment of the pedestrian re-recognition method based on the heterogeneous network feature interaction. Referring to fig. 2, fig. 2 is a flowchart of a first embodiment of a pedestrian re-recognition method based on heterogeneous network feature interaction according to the present invention.

As shown in fig. 1, a first embodiment of the present invention proposes a pedestrian re-recognition method based on heterogeneous network feature interaction, where the method is applied to a pedestrian re-recognition device based on heterogeneous network feature interaction, and the method includes:

step S101, designing a pedestrian re-identification initial model based on heterogeneous network characteristics of a convolutional neural network and a visual transformer;

deep learning represented by convolutional neural networks (Convolutional Neural Network, CNN) has been largely successful in the field of computer vision. The shallow layer of the convolutional neural network is good at extracting basic features in the image, such as the edge contour and color features of pedestrians. With the stacking of network layers, the deep layers of the network gradually extract abstract semantic information. Shallow feature maps are critical to the generation of deep feature maps, which rely on shallow feature maps. If some fine granularity information is ignored in the shallow network, the characteristics of the last layer of the network can not fully express pedestrian characteristics, and the model is easy to encounter bottlenecks. Such network models tend to focus only on the portion of the image that contributes most to recognition performance, without considering all the information of pedestrians; importantly, the ignored portions often also have identification value.

The vision transformer (Vision Transformer, viT) achieves good performance in a variety of vision tasks, which can create a flexible and dynamic feel through the stacking of self-attention modules, focusing attention on a target area. The multi-head self-attention can adaptively pay attention to various information of different areas from the global angle, and provides richer characteristic information for the re-identification task. Vision Transformer is susceptible to problems of loss of detail information during dicing, and cannot capture fine-grained information. It may result in a limit on the ability to discriminate pedestrians with similar attributes. In addition, vision Transformer lacks generalized bias for locality, isodegeneration, etc.

Based on the advantages and disadvantages of CNN and ViT, the pedestrian re-recognition initial model is designed based on the heterogeneous network characteristics of the convolutional neural network and the visual transformer. The pedestrian re-identification initial model comprises a convolutional neural network branch and a visual transformer branch, wherein the two branches are mutually connected and independently run, and relevant features of images are respectively extracted and fused. Specifically, the training image set is respectively input into a convolutional neural network module and a visual transformer network module, relevant features are extracted, and training is performed based on preset initial parameters so as to obtain a pedestrian re-identification initial model.

Step S102, calculating a loss value of the pedestrian re-recognition initial model based on double loss, determining that the pedestrian re-recognition initial model converges and stopping training based on the loss value, and obtaining a pedestrian re-recognition model;

the present embodiment determines whether the model converges based on the loss function. Because the embodiment comprises the convolutional neural network branch and the visual transformer branch, and the characteristic preference extracted by each branch is different, the invention respectively constrains the characteristics of the two branches in order to ensure the training effect of the heterogeneous network.

Specifically, the loss functions of the convolutional neural network branch and the visual transformer branch are calculated respectively, the final loss function is the sum of the loss functions of the two branches, when the final loss function is minimum, model convergence is determined to stop training, corresponding model parameters are stored, and a pedestrian re-identification model is obtained.

And step S103, re-identifying the target pedestrian image based on the pedestrian re-identification model.

After the pedestrian re-recognition model is obtained, the target pedestrian image and the candidate pedestrian image are input into the pedestrian re-recognition model, the feature extraction is carried out by the pedestrian re-recognition model, and then the candidate pedestrian image closest to the target pedestrian feature is determined from the candidate pedestrian images, namely, the pedestrian re-recognition result is obtained.

Through the scheme, the pedestrian re-recognition initial model is designed based on the heterogeneous network characteristics of the convolutional neural network and the visual transformer; calculating a loss value of the pedestrian re-recognition initial model based on the double loss, determining that the pedestrian re-recognition initial model converges and stopping training based on the loss value, and obtaining a pedestrian re-recognition model; and re-identifying the target pedestrian image based on the pedestrian re-identification model. The pedestrian re-recognition model is built based on the heterogeneous network characteristics of the convolutional neural network and the visual transformer, and the shallow characteristic characteristics and the deep characteristic characteristics are fused, so that the basic characteristics of the image can be utilized, the global characteristics of the image can be utilized, a large number of image characteristics can be obtained, and the recognition result is more accurate.

As shown in fig. 3, a second embodiment of the present invention provides a pedestrian re-recognition method based on heterogeneous network feature interaction, based on the first embodiment shown in fig. 1, the step of designing a pedestrian re-recognition initial model based on the heterogeneous network features of the convolutional neural network and the visual transformer includes:

step S1011, constructing a convolution neural network branch of the pedestrian re-recognition initial model; and

the CNN branch adopts a characteristic pyramid structure to give a pedestrian imageWhere W, H and C represent the width, height and number of channels of the image, respectively. Feature resolution decreases with increasing network depth while the number of channels C increases.

Dividing the branch of the convolutional neural network into three stages, and respectively representing the three stages as a first Stage ₁ Stage two ₂ And third Stage ₃ As shown in fig. 4, fig. 4 is a schematic diagram of a pedestrian re-recognition model related to the pedestrian re-recognition method based on heterogeneous network feature interaction. Each Stage is composed of a different number of residual blocks, stage in this embodiment ₁ To Stage ₃ The residual blocks of (1) are set to 7, 8, respectively, in this order. Except Stage ₁ The first residual block in (a), the latter residual blocks are all present in pairs. The feature resolution drops 2-fold from stage to stage while the channel number rises 2-fold from stage to stage. The outputs of the same stage residual blocks have the same feature resolution. Stage branching on CNN ₃ The feature map output in the stage is subjected to convolution, normalization and global average pooling operation to generate a group of 768-dimensional global feature vectors.

Step S1012, constructing a visual transformer branch of the pedestrian re-identification initial model;

specifically, an input pedestrian image is represented as an image block sequence including a plurality of image blocks;

the visual Transformer branch consists of L Transformer (Transformer) modules, each Transformer module being identical in structure. The input pedestrian image is represented as a sequence { x } of N image blocks _i I=1, 2, …, N }, per tileWhere p represents the spatial dimension of each tile.

in this embodiment a learning embedding matrix is usedThe image block is projected into a D-dimensional embedded representation.

specifically, a token classken x which can be learned _cls Can be used as a discrimination representation to be concatenated with the D-dimensional embedded representation of the N image blocks described above. And additionally learning spatial information and camera information by adding position coding (PositionEmbedding, PE) and camera coding (CameraEmbedding, CE) to each image block to form a final embedded image block sequence z ₀ The method comprises the following steps:

in the formula (1)

Embedding the represented image block z ₀ Are sequentially fed into N transducer modules, which are sequentially processed by layer normalization (LayerNorm, LN), multi-headed Attention Mechanism (MSA), and multi-layer perceptron (Multilayer Perceptron, MLP), as shown in equations (2), (3). Residual connection is carried out on the input and the output of the multi-head attention layer, and normalization processing is carried out. MLP is two-layer, with the activation function being a GELU function.

z′ _l ＝MSA(LN(z _l-1 ))+z _l-1 l＝1,…,L

(2)

z _l ＝MLP(LN(z′ _l ))+z′ _l l＝1,…,L

(3)

Multi-headed attention can be expressed as:

MSA(z)＝Concat(H ₁ ,H ₂ ,…,H _h )W ^o

(4)

in (4)h is the number of heads, concat (-) represents stacking over the embedded representation dimension of the image block. Each head H _i (i=1, 2, …, h) can be expressed as:

H _i ＝Attention(Q,K,V) (5)

wherein the method comprises the steps of

In the formula (5), Q, K and V are obtained by inputting three linear mappings into zThe linear mapping can be expressed asThese are three completely different matrices, projecting the inputs into different spaces, so the expressive power is much higher. Attention (-) represents the Attention process and is a function used to calculate the relevance and importance of image blocks. D in formula (6) _k ＝d _v =d/h, where ∈>The scaling factor is a normalization operation for realizing the numerical stability.

Step S1013, fusing the shallow heterogeneous characteristic of the convolutional neural network branch and the deep heterogeneous characteristic of the vision transformer branch to obtain the pedestrian re-recognition initial model.

Features generated by the CNN branch and ViT branch are heterogeneous, both in feature dimension and semantically. In order to fuse two heterogeneous features, the present invention discards the strategy of very extremely fusing all layers, but applies the heterogeneous feature fusion module shown in fig. 4 to shallow features (Stage ₁ ) And deep features (Stage) ₃ )。

Specifically, three-dimensional shallow heterogeneous characteristics of a convolutional neural network branch are transformed into two dimensions through convolution of 1×1, global average pooling operation is carried out on the shallow heterogeneous characteristics of the convolutional neural network branch, focal characteristics are reserved, and the focal characteristics are flowed into the vision transformer branch;

the features extracted by the CNN branch are dense, with the dimensions being three-dimensional, and the embedded dimensions of the ViT branch being two-dimensional. When CNN branch features flow into ViT, the channel dimensions need to be transformed with a 1 x 1 convolution first to obtain features of the same dimension. The significance of the features of the convolutional neural network middle layer in the channel dimension is understood to be the response of the input image to the different modes. Unlike CNN learning strategy, the features extracted by ViT and CNN branches have semantic gaps, viT can see the whole world from the beginning, so that the focus features of some most-expressed images are reserved for the features of the CNN branches by using global average pooling operation, and are collected into ViT branches after being subjected to layer normalization treatment, so that the fine-grained features of key areas are supplemented.

when the representation in ViT is fed back to the CNN branch, the channel alignment is realized by 1×1 convolution, then the characteristic spatial resolution is aligned with the CNN characteristic by interpolation after batch normalization processing, and then the characteristic spatial resolution is added into the characteristic diagram of the CNN branch. The ViT branch representation and the CNN branch feature are interacted through the feature interaction module, so that the high-response feature of the target pedestrian is further accurately positioned on the basis of obtaining the complete feature of the target pedestrian, the final feature can have more information quantity, and the generalization capability of the feature can be improved. Through shallow layer characteristic coupling module and deep layer characteristic coupling module, can assemble the various difference characteristics of pedestrian structure, enrich final pedestrian characteristic representation.

Fig. 5 is a schematic diagram of feature interaction related to the pedestrian re-recognition method based on heterogeneous network feature interaction. Fig. 5 shows the flow of feature interactions in its entirety: the shallow heterogeneous characteristics of the CNN branch are converged into a ViT branch through a shallow characteristic interaction module; and the deep heterogeneous features of the ViT branch are converged into the CNN branch through a deep feature interaction module.

And splicing the global feature vectors obtained by the convolutional neural network branch and the visual transformer branch to obtain the pedestrian re-identification feature vector. Finally, 1536-dimensional pedestrian re-recognition feature vectors can be obtained.

According to the scheme, the pedestrian re-recognition initial model is designed based on the heterogeneous network characteristics of the convolutional neural network and the visual transformer, so that characteristic interaction is realized, more pedestrian characteristic representations are obtained, and the pedestrian re-recognition accuracy is improved.

Fig. 6 is a schematic flow chart of a third embodiment of the pedestrian re-recognition method based on heterogeneous network feature interaction, and as shown in fig. 6, the third embodiment of the invention provides a pedestrian re-recognition method based on heterogeneous network feature interaction, wherein training the initial model of pedestrian re-recognition based on double loss includes:

step S1021, setting a first classifier for calculating a branch loss function of the convolutional neural network, and setting a second classifier for calculating a branch loss function of the visual transformer;

in the training phase, a classifier is set for each branch individually, i.e. the classifier parameters are not shared. The losses calculated by the two classifiers are added to re-optimize the entire network.

Step S1022, determining that the pedestrian re-recognition initial model converges and stopping training based on the sum of the first loss function calculated by the first classifier and the second loss function obtained by the second classifier.

For each branch feature, the training loss function is composed of two parts, as shown in equations (7) and (8).

L _tri ＝log[1+exp(d _pos -d _neg )]

(8)

Wherein L is _ID And L _tri Representing a cross entropy loss function and a triplet loss function, y, respectively _i Representing the tag vector, p _i Representing the probability value output by the network full-connection layer, N is the number of images of each batch of the input network, d _pos And d _neg Representing the distance of the positive and negative pairs of samples, respectively.

The final loss function is obtained by summing the loss functions of the two branches as shown in equation (9).

Alpha is used to trade-off the two losses and lambda is used to trade-off the two branch losses. Here, α is set to 1, λ is set to 0.5 according to experiments, and the network model is optimized by minimizing the combined loss function, that is, the convergence of the pedestrian re-recognition model is determined when the minimum value is obtained in the formula (9), and at this time, the pedestrian re-recognition model with the best recognition effect is considered to be obtained, so that training of the model can be stopped.

According to the embodiment, through the scheme, model convergence is determined based on the sum of loss functions of the convolutional neural network branches and the vision transformer branches, and the final pedestrian re-recognition effect is ensured.

As shown in fig. 7, a fourth embodiment of the present invention proposes a pedestrian re-recognition method based on heterogeneous network feature interaction, where the re-recognition of a target pedestrian image based on the pedestrian re-recognition model includes:

step S1031: performing similarity measurement on the characteristics of the target pedestrian image and the plurality of candidate pedestrian images based on the pedestrian re-recognition model to obtain a recognition distance matrix;

and extracting the characteristics of the target pedestrian image and the candidate pedestrian images in the candidate set by using the trained pedestrian re-recognition model to carry out similarity measurement, and obtaining a recognition distance matrix by carrying out similarity comparison.

Step S1032: the candidate pedestrian in the candidate pedestrian image corresponding to the minimum recognition distance matrix is determined as the target pedestrian.

The candidate pedestrian image having the smallest distance to the target pedestrian is determined as the most likely candidate pedestrian. The euclidean distance of two samples in the embedding space is typically calculated as the similarity, and is typically the simplest and most easily understood distance.

In x _p And x _g Representing the target pedestrian image (probe) and the candidate pedestrian images in the candidate set gamma respectively,and->Is the output characteristic that they cascade together through two branches of the TCCNet network. X is x _p And x _g Euclidean distance D between features of (a) _pg The calculation is as follows:

thus, the target pedestrian is locked based on the Euclidean distance, and the pedestrian re-recognition result is obtained.

As shown in fig. 8, a fifth embodiment of the present invention proposes a pedestrian re-recognition method based on heterogeneous network feature interaction, where after calculating a loss value of the initial pedestrian re-recognition model based on double loss, determining that the initial pedestrian re-recognition model converges and stops training based on the loss value, and obtaining a pedestrian re-recognition model, the method further includes:

step S104, testing the pedestrian re-identification model to obtain an evaluation index;

step S105, performing a comparison experiment on the network structure parameters of the pedestrian re-recognition model based on the evaluation index, and determining target network structure parameters for optimizing the pedestrian re-recognition model based on the target network structure parameters.

The proposed method tests on two large public data sets of MSMT 17. MSMT17 was taken by 12 outdoor cameras and 3 indoor cameras for a total of 4,101 images of 126,441 pedestrians. 23,621 images of 1041 pedestrians are used as a training set, 93,820 images of 3060 pedestrians are used as a test set, 11,659 images are randomly selected as queries, and the remaining 82,162 images are used as a candidate set of bullets. It is a large dataset that is closer to the real scene, which is a dataset that is challenging to the task of pedestrian re-recognition.

In a task of pedestrian re-recognition, a target pedestrian image (query) to be queried is generally given in a test process, then similarity is calculated between the target pedestrian image (query) and candidate images in a candidate set (gamma) based on a pedestrian re-recognition model, and then the images which are closer to the query image are arranged in a sequence from large to small according to the similarity. In order to evaluate the performance of pedestrian re-recognition algorithms, it is currently the practice to calculate the corresponding index on the public data set and then compare it to other models. CMC curves (CumulativeMatching Characteristics) and mAP (mean Average Precision) are the two most commonly used evaluation criteria.

In this embodiment, the most commonly used rank-1, rank-5 and mAP indexes in the CMC curve are mainly selected, wherein rank-k refers to the probability that the k top-level (highest confidence) graph in the search results has the correct result, and mAP indexes are actually equivalent to an average level, and the higher mAP indicates that the query result of the same person as the query is relatively higher in the whole ranking list, which indicates that the model effect is also better.

The specific network structure parameters are as follows: the resolution of the input image is sized 256×128, the batch size is selected to be 64, including 16 different pedestrians and 4 images per person. Data enhancement is performed by random horizontal flipping, padding, random cropping, random erasure, and normalization strategies. The total number of training wheels is set to 150. The present model was optimized using SGD with weight decay factor set to 1e-4 and momentum set to 0.9. The network is activated by using the wakeup strategy to preheat the learning rate, the initial learning rate is 0.009, and cosine strategy attenuation is used to keep stable loss reduction. And determining optimal target network structure parameters based on experimental results.

Based on the evaluation index and experimental details, tests were performed on the MSMT17 dataset, resulting in experimental results as shown in table 1. As can be seen from the experimental results in Table 1, mAP using ResNet50 alone as the base line network can reach 56.7%, and mAP using Vision Transformer alone as the base line network can reach 61.8%. The two base lines are combined into a parallel network, and a shallow characteristic interaction module and a deep characteristic interaction module are added, wherein mAP reaches 66.6%, which indicates that the heterogeneous network structure extracts comprehensive and very remarkable characteristics, and the influence of background noise on the model is reduced, so that the effect of improving the performance of the model is achieved. The invention has significantly improved performance compared to two baseline models and reaches a competitive level.

TABLE 1 influence of different modules on MSM 17 dataset

The recognition result shows that the number of error images retrieved by the ResNet 50-based base line network and the Vision Transformer-based base line network is larger than the model provided by the invention. Furthermore, among the images retrieved by the present invention, the first few images (especially rank-5) are more accurate than the ResNet50 based base line network and Vision Transformer based base line network. This means that the pedestrian re-recognition model provided by the invention can pay attention to more effective information of the human body area, so that more correct target character images can be retrieved, namely, the effectiveness of the heterogeneous network characteristic interaction model is higher.

According to the embodiment, the pedestrian re-recognition model is tested, so that the model is further optimized, and a better pedestrian re-recognition effect is achieved.

Further, to achieve the above objective, the present invention further provides a pedestrian re-recognition device based on heterogeneous network feature interaction, specifically referring to fig. 9, fig. 9 is a schematic functional block diagram of a first embodiment of the pedestrian re-recognition device based on heterogeneous network feature interaction, where the device includes:

the model construction module 10 is used for designing a pedestrian re-identification initial model based on the heterogeneous network characteristics of the convolutional neural network and the visual transformer;

the calculation module 20 is configured to calculate a loss value of the pedestrian re-recognition initial model based on the double loss, determine that the pedestrian re-recognition initial model converges and stops training based on the loss value, and obtain a pedestrian re-recognition model;

and the re-recognition module 30 is used for re-recognizing the target pedestrian image based on the pedestrian re-recognition model.

In addition, the invention further provides a computer readable storage medium, the computer readable storage medium stores a pedestrian re-recognition program based on heterogeneous network feature interaction, and the steps of the pedestrian re-recognition method based on heterogeneous network feature interaction are realized when the pedestrian re-recognition program based on heterogeneous network feature interaction is run by a processor, and are not repeated herein.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or modifications in the structures or processes described in the specification and drawings, or the direct or indirect application of the present invention to other related technical fields, are included in the scope of the present invention.

Claims

1. The pedestrian re-identification method based on heterogeneous network feature interaction is characterized by comprising the following steps of:

the method comprises the steps of designing a pedestrian re-recognition initial model based on heterogeneous network characteristics of a convolutional neural network and a visual transformer, namely respectively inputting a training image set into the convolutional neural network module and the visual transformer network module, extracting relevant characteristics and training based on preset initial parameters to obtain the pedestrian re-recognition initial model;

re-identifying the target pedestrian image based on the pedestrian re-identification model;

the step of designing the pedestrian re-identification initial model based on the heterogeneous network characteristics of the convolutional neural network and the visual transformer comprises the following steps of: constructing a convolutional neural network branch of the pedestrian re-recognition initial model; and constructing a visual transformer branch of the pedestrian re-identification initial model; fusing the shallow heterogeneous characteristics of the convolutional neural network branch with the deep heterogeneous characteristics of the visual transformer branch to obtain the pedestrian re-recognition initial model;

the fusing the shallow heterogeneous characteristics of the convolutional neural network branch and the deep heterogeneous characteristics of the vision transformer branch comprises the following steps: transforming the three-dimensional shallow heterogeneous characteristics of the convolutional neural network branch into two dimensions through convolution of 1 multiplied by 1, carrying out global average pooling operation on the shallow heterogeneous characteristics of the convolutional neural network branch to reserve focal characteristics, and flowing the focal characteristics into the vision transformer branch;

the fusing the shallow heterogeneous characteristics of the convolutional neural network branch and the deep heterogeneous characteristics of the vision transformer branch further comprises: carrying out dimension alignment on deep heterogeneous features of the visual transformer branch through 1 multiplied by 1 convolution to obtain three-dimensional deep heterogeneous features, carrying out normalization processing on the three-dimensional deep heterogeneous features, carrying out feature resolution alignment on the basis of interpolation to obtain features to be exchanged, and flowing the features to be exchanged into the convolution neural network branch; splicing the global feature vectors obtained by the convolutional neural network branch and the visual transformer branch to obtain a pedestrian re-identification feature vector;

wherein the calculating the loss value of the initial model for pedestrian re-recognition based on the double loss, and the determining the initial model for pedestrian re-recognition based on the loss value to converge and stop training comprises: setting a first classifier for calculating a branch loss function of the convolutional neural network, and setting a second classifier for calculating a branch loss function of the visual transformer; and determining that the pedestrian re-recognition initial model converges and stopping training based on the sum of the first loss function calculated by the first classifier and the second loss function obtained by the second classifier.

2. The method of claim 1, wherein said constructing a visual transformer branch of the pedestrian re-recognition initial model comprises:

3. The method of claim 1, wherein the re-identifying the target pedestrian image based on the pedestrian re-identification model comprises:

4. The method according to claim 1, further comprising, after the calculating of the loss value of the pedestrian re-recognition initial model based on the double loss, determining that the pedestrian re-recognition initial model converges and stops training based on the loss value, obtaining a pedestrian re-recognition model:

testing the pedestrian re-identification model to obtain an evaluation index;

5. The pedestrian re-recognition device based on heterogeneous network feature interaction is characterized by adopting the pedestrian re-recognition method based on heterogeneous network feature interaction as set forth in any one of claims 1-4, and comprising:

6. A pedestrian re-recognition device based on heterogeneous network feature interactions, comprising a memory, a processor, and a pedestrian re-recognition program based on heterogeneous network feature interactions stored on the memory, wherein the pedestrian re-recognition program based on heterogeneous network feature interactions, when executed by the processor, implements the steps of the pedestrian re-recognition method based on heterogeneous network feature interactions of any one of claims 1-4.