CN114241278B

CN114241278B - Multi-branch pedestrian re-identification method and system

Info

Publication number: CN114241278B
Application number: CN202111641939.2A
Authority: CN
Inventors: 何东之; 王鹏飞; 孙亚茹; 张震; 郭隆杭
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2024-05-07
Anticipated expiration: 2041-12-29
Also published as: CN114241278A

Abstract

The invention discloses a multi-branch pedestrian re-identification method and a multi-branch pedestrian re-identification system, which belong to the technical field of machine learning, wherein the method comprises the following steps: acquiring a multi-branch network based on a residual neural network, wherein the multi-branch network comprises: the method comprises the steps that a plurality of middle convolution layers and tail end convolution layers are sequentially connected, features extracted by one or a plurality of middle convolution layers enter branches, downstream operation is carried out, and obtained tail end features are fused with segmentation features; training an identification model based on the multi-branch network; according to the recognition model, carrying out feature recognition on the pedestrian image and the image to be recognized; and carrying out pedestrian re-recognition according to the similarity of the features. The features extracted by the middle convolution layer have good foreground space information and pedestrian local features, the segmentation features are obtained after the features enter the branches, and the tail end features extracted by the tail end convolution layer have rich semantic information; the segmentation features and the advanced features are fused, so that the method has good space information and rich semantic information finally, and the interference of the background on feature matching is avoided.

Description

Multi-branch pedestrian re-identification method and system

Technical Field

The invention relates to the technical field of machine learning, in particular to a multi-branch pedestrian re-identification method and system.

Background

With the rapid development of urban electronic information, video monitoring analysis has become one of the main means of new generation investigation and evidence collection. The pedestrian re-recognition technology is a very key technology for video analysis, and mainly refers to a pedestrian image shot by a given camera, on one hand, whether a pedestrian target can reappear in a monitoring area of the camera is judged; on the other hand, it is also determined whether the pedestrian object will appear in the area monitored by other cameras, and then locating the position of the pedestrian object according to the camera position containing the pedestrian object is widely regarded as a sub-problem of image retrieval. For example, given a pedestrian image, detecting the pedestrian image and motion trajectory across devices, pedestrian re-recognition presents many different challenges due to the complexity of real-world scenes, including mainly large changes in pose and clothing, severe occlusion, clutter, etc.

The pedestrian re-recognition task can be divided into two processes of feature extraction and measurement learning, wherein the feature extraction is to obtain a visual feature which is robust and has high discriminant; the metric learning is mainly to calculate the similarity of the extracted pedestrian feature vectors by using different metrics, so that the feature vectors of the similar samples are closer than the feature vectors of the non-similar samples. In order to be able to extract more distinguishing features, many approaches have focused on learning a combination of global and local feature representations to enhance the feature representation. In recent years, convolutional Neural Networks (CNNs) exhibit strong feature representation capabilities in various deep learning tasks, mainly of the following classes:

One is to use a priori knowledge of the pose or body identity to locate the body part. In this case, however, the performance of the convolutional neural network is highly dependent on the robustness of the pose or body identification estimation model. Unexpected errors such as pose estimation errors may greatly affect the recognition result.

And the other is to divide the original picture or the feature map into strips by assuming that the images are aligned, respectively perform measurement learning, and finally splice the strip features to be used as final feature representation. Although the convolutional neural network-based method has achieved great success, detail information is lost due to continuous convolution and downsampling operations, and the final features of the convolutional network only have fuzzy global features, which is not beneficial to detail matching. In addition, the influence on the same pedestrian from different backgrounds is larger, which can bring great noise to the identification features and interfere with feature matching.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a multi-branch pedestrian re-recognition method and system, which are characterized in that high-level features with rich semantic information and low-level features with good space information are respectively extracted through different branches and are fused, so that interference caused by the background on feature matching is avoided.

The invention discloses a multi-branch pedestrian re-identification method, which comprises the steps of obtaining a multi-branch network based on a residual neural network, wherein the multi-branch network comprises the following components: the features extracted by one or more intermediate convolution layers enter the branch and enter downstream operations including: carrying out pooling and segmentation on the features entering the branches, carrying out point convolution operation to obtain a plurality of segmentation features, carrying out pooling operation on the features extracted by the tail end convolution layer, carrying out point convolution operation to obtain tail end features, and fusing the tail end features and the segmentation features to obtain identification features; training an identification model based on the multi-branch network; according to the recognition model, carrying out feature recognition on the pedestrian image and the image to be recognized; and carrying out pedestrian re-recognition according to the similarity of the recognition features.

Preferably, the intermediate convolution layers include a third convolution layer conv3 and a fourth convolution layer conv4, the end convolution layers include a fifth convolution layer conv5, the branches include second branches,

The third convolution layer, the fourth convolution layer and the tail end convolution layer are sequentially connected;

after the fifth convolution layer deletes the downsampling operation, a sixth convolution layer conv6 as a second branch;

the features extracted by the fourth convolution layer are sent to the sixth convolution layer for feature extraction, and sixth features are obtained;

And after the sixth feature is segmented through pooling operation, carrying out point convolution to obtain a plurality of first segmented features.

Preferably, the branches further comprise a third branch,

In a third branch, after the third feature extracted by the third convolution layer and the fourth feature extracted by the fourth convolution layer are fused, carrying out pooling operation, segmentation and point convolution operation in sequence to obtain a plurality of second segmentation features;

and fusing the first segmentation feature, the second segmentation feature and the tail end feature.

Preferably, the feature fusion method between branches comprises the following steps:

the fourth feature is fused with the third feature after up-sampling operation or interpolation operation;

the fused features are subjected to pooling operation and then are fused with the fourth features again to obtain third fused features entering the second branch;

after pooling operation, the third fusion feature is fused with the feature extracted by the sixth convolution layer to obtain a second fusion feature entering the second branch;

and after the second fusion feature is subjected to pooling operation, integrating the second fusion feature with the feature extracted by the tail end convolution layer to obtain a first fusion feature entering the main branch.

Preferably, the semantic supervision method comprises the following steps:

The multi-branch network packet is sequentially connected with a first convolution layer, a second convolution layer and a third convolution layer;

the extracted characteristics of the first convolution layer, the second convolution layer and the third convolution layer are feat0, feat1 and feat2 respectively;

After feat a convolution operation and an up-sampling operation, fusing with feat1 to obtain a feature de_ feat1;

after the convolution operation and the up-sampling operation of de_ feat, the obtained product is fused with feat0 to obtain a characteristic de_ feat0;

after the convolution operation and the up-sampling operation of the de_ feat, obtaining a feature de_ feat consistent with the size of the pedestrian image;

Obtaining a foreground mask image of the pedestrian image according to the segmentation model;

calculating a loss between de feat and the foreground mask map using the cross entropy loss function;

and according to the loss, supervising the extraction of the characteristics.

Preferably, the convolution block of the convolution operation includes a convolution with a kernel of 3x3 size, BN layer, and ReLU activation function;

in the third branch, the third fusion feature is subjected to horizontal segmentation through a pooling operation with a kernel of 8x8 to obtain a plurality of second segmentation blocks;

After the second segmentation block passes through a point convolution layer of 1x1, a plurality of 256-dimensional second segmentation features are obtained;

The second fusion feature is subjected to horizontal segmentation through pooling operation with a convolution kernel of 12 multiplied by 8, and a plurality of first segmentation blocks are obtained;

after the first segmentation block passes through a point convolution layer of 1x1, a plurality of first segmentation features reduced to 256 dimensions are obtained;

The first fusion feature is passed through a pooling layer with a convolution kernel of 12 x 4 and a point convolution layer of 1x1 to obtain 256-dimensional end features.

Preferably, the third feature extracted by the third convolution layer and the fourth feature extracted by the fourth convolution layer are measured by using a self-attention mechanism;

The representativeness of the post-measurement features is enhanced by the self-attention mechanism using orthogonal regularization.

Preferably, the method for calculating the loss comprises the following steps:

After 24×8 pooling and 1×1 point convolution dimension reduction operation are carried out on the third fusion feature, a third measurement feature fg_L3 is obtained;

After 24×8 pooling and 1×1 convolution dimension reduction operation are carried out on the second fusion feature, a second metric feature fg_L2 is obtained;

after the downstream features of each branch sequentially pass through the batch normalization layer and the full connection layer, calculating classification loss through a Softmax loss function;

Calculating a loss of a metric feature from a triplet loss function, the metric feature comprising: an end feature, a second metric feature, and a third metric feature;

The calculation method of the total loss is expressed as:

Where L _total is the total loss, L _softmax is the classification loss of the downstream feature, m is the number of downstream features, L _triplet is the loss of the metrology feature, n is the number of metrology features, A penalty term denoted as orthogonal regularization.

The invention also provides a system for realizing the pedestrian re-recognition method, which comprises a multi-branch network terminal acquisition module, a training module, a characteristic recognition module and a re-recognition module;

the multi-branch network acquisition module is used for acquiring a multi-branch network based on a residual neural network;

the training module is used for training an identification model based on the multi-branch network;

the feature recognition module is used for carrying out feature recognition on the pedestrian image and the image to be recognized according to the recognition model;

And the re-recognition module is used for carrying out pedestrian re-recognition according to the similarity of the recognition features.

Preferably, the recognition model comprises a main branch, a second branch, a third branch, a self-attention module, an orthographic module and a loss calculation module;

The first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer are sequentially connected and serve as main branches;

the downsampling operation of the fifth convolutional layer is deleted, and as a sixth convolutional layer,

A sixth convolution layer connected to the fourth convolution layer as a second branch;

Fusing the features extracted by the third convolution layer and the fourth convolution layer, and enabling the obtained third fused feature to enter a third branch;

the third fusion feature is in a third branch, and a plurality of second segmentation features are obtained after pooling operation, segmentation and dimension reduction operation are sequentially carried out;

After the third fusion feature is fused with the feature extracted by the sixth convolution layer, the third fusion feature is taken as a second fusion feature to enter a second branch;

the second fusion features sequentially undergo pooling operation, segmentation and dimension reduction operation in a second branch to obtain a plurality of first segmentation features;

The second fusion feature is fused with the feature extracted by the fifth convolution layer in the main branch to obtain a first fusion feature;

The first fusion feature is in the main branch, and the terminal feature is obtained after the pooling operation and the dimension reduction operation are sequentially carried out;

after the tail end features are fused with the first segmentation features and the second segmentation features, recognition features of pedestrian images and images to be recognized are obtained;

The self-attention module is used for: optimizing third and fourth features extracted by the third and fourth convolution layers based on a self-attention mechanism;

The orthogonalization module is used for re-optimizing the optimized characteristics of the self-attention module according to a regularization method;

the loss calculation module is used for calculating the loss of the identification model according to the extracted characteristics of each branch.

Compared with the prior art, the invention has the beneficial effects that: the multi-branch network based on the residual neural network has the advantages that the characteristics extracted by the middle convolution layer have good foreground space information and pedestrian local characteristics, the segmentation characteristics with good space information are obtained after the branches are entered, and the tail end characteristics extracted by the tail end convolution layer have rich semantic information; the segmentation features and the terminal features are fused, so that the final features have good space information and rich semantic information at the same time, and interference caused by the background on feature matching is effectively avoided.

Drawings

FIG. 1 is a flow chart of a multi-branched pedestrian re-recognition method of the present invention;

FIG. 2 is a logical block diagram of a multi-drop network;

FIG. 3 is a logic block diagram of a multi-drop network in embodiment 1;

FIG. 4 is a logic block diagram of a multi-drop network in example 2;

FIG. 5 is a schematic diagram of an attention mechanism;

FIG. 6 is a system logic block diagram of the present invention;

fig. 7 is a schematic diagram of semantic supervision.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention is described in further detail below with reference to the attached drawing figures:

A multi-branched pedestrian re-recognition method, as shown in fig. 1, the method comprising:

step 101: a multi-branch network based on a residual neural network (ResNet) is acquired. Wherein the residual neural network comprises ResNet, resNet, resNet, resNet, 101, or ResNet, 152.

Fig. 2 shows a logical structure of a multi-branch network including: a plurality of middle convolution layers conv2, conv3, conv4 and end convolution layers conv5 connected in sequence;

Wherein features extracted by one or more intermediate convolutional layers enter the branches and enter downstream operations;

The downstream operations include: pooling and dividing the features entering the branches, and performing point convolution operation to obtain a plurality of divided features;

Performing pooling operation on the features extracted by the tail end convolution layer, and performing point convolution operation to obtain tail end features;

and fusing the tail end features with the segmentation features to obtain identification features.

Step 102: based on the multi-branch network, an identification model is trained. In the training of the recognition model, a training set is constructed, and the training set is trained, so that the method for obtaining the recognition model is the prior art, and the description of the method is omitted.

Step 103: and respectively carrying out feature recognition on the pedestrian image and the image to be recognized according to the recognition model to obtain recognition features.

Step 104: and carrying out pedestrian re-recognition according to the similarity of the recognition features.

The multi-branch network based on the residual neural network has the advantages that low-level features extracted by the middle convolution layer have good foreground space information and pedestrian local features, segmentation features with good space information are obtained after branches are entered, and terminal features extracted by the terminal convolution layer have rich semantic information; the segmentation features and the terminal features are fused, so that the final features have good space information and rich semantic information at the same time, and interference caused by the background on feature matching is effectively avoided.

In fig. 2, the convolution layer includes: the first convolution layer conv1, the second convolution layer conv2, the third convolution layer conv3 and the fourth convolution layer conv4 are sequentially connected, and the tail convolution layer is a fifth convolution layer conv5.

Example 1

As shown in fig. 3, the features extracted by conv4 enter the second branch, a second downstream operation is performed, after the downsampling operation is deleted by the fifth convolution layer conv5, the features extracted by conv4 enter conv6 as a sixth convolution layer conv6 in the second downstream operation, so as to extract the sixth features, and the second downstream operation may further include a pooling operation (Pooling), a segmentation operation and a point convolution, so as to obtain a plurality of first segmentation features. The operation of conv5 includes a downsampling operation and a convolution operation, wherein the downsampling operation (undersampling) is used for reducing the image, which results in the loss of part of feature space details, and the downsampling operation is deleted from conv6, so that the feature size of the image processed by conv6 is consistent with that of conv4, and the space details in conv4 are reserved.

Features extracted by the intermediate convolution layers conv4 and conv3 enter a third branch for a third downstream operation. The third downstream operation includes: fusion, pooling, segmentation and point convolution to obtain a plurality of second segmentation features. The spatial structure information of the fourth feature extracted by conv4 and the third feature extracted by conv3 saves good pedestrian local features, and the spatial information in the third branch is more than that in the second branch.

Downstream operations of the main branch include pooling and point convolution.

The second branch and the third branch are divided, the feature is divided into a plurality of feature blocks horizontally or vertically, such as 2 or 3, measurement learning is performed respectively, and finally the division features generated by the feature blocks are spliced; the pooling operation can adopt an average pooling mode and a maximum pooling mode to pool the characteristics together. After the point convolution operation of different branches, the characteristics of the branches are reduced to a unified dimension, such as 256 dimensions, so that characteristic fusion is facilitated.

Wherein the segmentation features comprise a first segmentation feature of the first branch and a second segmentation feature of the second branch. And fusing the first segmentation feature, the second segmentation feature and the tail end feature to obtain an identification feature.

Example 2

In this embodiment, resNet is used as the base residual neural network, conv1-conv5 is shared, and the last pooling layer and full-connection layer are deleted. But the residual neural network is not limited thereto.

Inter-branch feature fusion

As shown in fig. 4, the feature fusion method between branches includes:

Step 201: the fourth feature is fused with the third feature after an up-sampling operation (up-sampling) or an interpolation operation to obtain a process feature. After the fourth feature upsampling operation, the size of the third feature is identical.

Step 202: and after the process feature convolution operation, fusing the process feature convolution operation with the fourth feature to obtain a third fused feature entering the third branch.

In a third downstream operation of the third branch: the third fusion feature is subjected to horizontal segmentation through a pooling operation with a convolution kernel of 8 multiplied by 8, a plurality of second segmentation blocks f3_p1, f3_p2 and f3_p3 are obtained, and after the second segmentation blocks respectively pass through a point convolution layer of 1 multiplied by 1, a plurality of second segmentation features with 256 dimensions are obtained, wherein the second segmentation features are l3-p1, l3-p2 and l3-p3: after the second segmentation feature is subjected to BN and FC treatment, a second segmentation loss feature is obtained: l3-p1L, L3-p2L, L3-p3L; and carrying out 24×8 pooling and 1×1 point convolution on the third fusion to obtain a third metric feature fg_L3 after dimension Reduction operation (Dim Reduction), and obtaining a second loss feature L3-pt after the third metric feature fg_L3 passes through the BN layer and the FC layer.

Step 203: and after the third fusion feature pooling operation, fusing the third fusion feature pooling operation with the features extracted by the sixth convolution layer conv6 to obtain a second fusion feature entering the second branch. After the third fusion feature pooling operation, the third fusion feature pool is used for keeping the same size when the third fusion feature is fused with the sixth feature extracted by conv 6; if the third fused feature is consistent with the sixth feature in size, the pooling operation may not be performed.

In the downstream operation of the second branch, the second fusion feature is horizontally segmented through a pooling operation with a convolution kernel of 12×8 to obtain a plurality of first segmentation blocks f2_p1 and f2_p2; after the first segmentation block passes through a point convolution layer of 1x1, obtaining a plurality of first segmentation features l2_p1 and l2_p2 which are reduced to 256 dimensions; after the first segmentation feature is processed by BN and FC, a first segmentation loss feature l2_p L, L2_p2L is obtained; after 24×8 pooling and 1×1 convolution dimension reduction operation are carried out on the second fusion feature, a second metric feature fg_L2 is obtained; the second metric feature passes through the BN layer and the FC layer, and then the first loss feature l2_pt is obtained.

Step 204: and after the second fusion feature pooling operation, fusing the second fusion feature pooling operation with the features extracted by the tail end convolution layer to obtain a first fusion feature entering the main branch.

In a downstream operation of the main branch: the first fusion feature passes through a pooling layer with a convolution kernel of 12×4 and a point convolution layer with a convolution kernel of 1×1 to obtain a 256-dimensional end feature fg_L1; after the end features are processed by the normalization (BN) layer and the Full Connection (FC) layer, a first loss feature l1_pt is obtained.

And the spatial information in the main branch and the second branch is improved through feature fusion among the branches. The multi-branch structure forms a lightweight bidirectional feature pyramid network (LB-FPN), the number of channels of the input feature map of each branch is different, the downstream of each branch unifies the feature maps of different branches into 256-dimensional channels through point convolution, and after batch normalization (Batch Normalization, BN) operation is used, the features of different branches are fused. The final fusion profile is shown in fig. 4.

In the invention, the fusion of feature graphs with different sizes or resolutions has the following two modes: firstly, the small images or feature images with low resolution are fused after the size or resolution is improved through interpolation operation, namely a top-down fusion mode is adopted; and secondly, reducing the size or resolution of the large image or the feature image with high resolution through pooling operation, and then fusing, namely a bottom-up fusion mode.

Semantic supervision

The embodiment also introduces a semantic supervision method, increases the extraction of conv2 and conv3 to the foreground, weakens the effect of the background in the features, and as shown in fig. 7, the semantic supervision method comprises the following steps:

step 301: the first, second and third convolution layers extract features feat0, feat1, feat2, respectively.

Step 302: feat2 convolution operations and upsampling operations, and then fused with feat1 to obtain the feature de feat1. The convolution block of the convolution operation comprises a convolution layer with a size of 3x3, a BN layer and a ReLU activation function, the up-sampling operation enables the feature to be consistent with the size of feat, and the feature is changed into the original twice size in a bilinear interpolation mode.

Step 303: after the de_ feat convolution operation and the up-sampling operation, the result is fused with feat0 to obtain the feature de_ feat0. The upsampling operation is followed by a feature feat0 size.

Step 304: after the de_ feat convolution operation and the up-sampling operation, a feature de_ feat consistent with the pedestrian image size is obtained.

Step 305: and obtaining a foreground mask image of the pedestrian image according to the segmentation model.

Step 306: the loss between de feat and the foreground mask map is calculated using a cross entropy loss function.

Step 307: and according to the loss, supervising the extraction of the characteristics.

The foreground mask graph after the segmentation of the existing human body segmentation model is utilized to monitor the feature extraction of the convolution layer, so that the extracted features are closer to the foreground features, the foreground extraction is enhanced, and the effect of the background in the features is weakened. The background information also plays a certain role in identification, and the background extraction is weakened in the invention, but the background information is not completely removed.

The semantic supervision method is only used in model training, is not used in feature recognition and inference, and can accelerate the speed and efficiency of reasoning.

Self-attention mechanism and orthogonal regularization

The embodiment also utilizes a self-attention mechanism to measure a third feature extracted by the third convolution layer conv3 and a fourth feature extracted by the fourth convolution layer conv 4; and the representativeness of the post-measurement features is enhanced by using orthogonal regularization.

The attention module is an effective mechanism in various machine learning scenarios, exploiting correlations between features to help models focus on more important features. In the invention, the self-attention module can effectively improve the generalization capability of the model, so that the model focuses on the relation among the features. As shown in fig. 4, self-attention mechanisms are inserted in the network backbone, consisting of spatial attention (PAM) and Channel Attention (CAM), respectively.

Spatial attention: given an input feature map X ε R ^C×H×W, where C, H, W is the number of channels, height and width, respectively, and R represents the set of real numbers in mathematics. PAM maps the features of each location onto a subspace of a lower dimension, thereby generating a three-dimensional lower feature map. The three feature maps are then correlated. The formula is:

Wherein, attention _p (X) represents the spatial Attention, Representing the corresponding position multiplication, σ () is a Softmax function, Q, K, V e R ^C×S is a feature transformation of the same size as X, s=h×w, Q ^T represents the transposed matrix, and a _p represents the spatial attention. PAM can obtain correlation between different positions of the feature map. PAM is to re-weigh each feature by its feature and features at other locations as shown in fig. 5 (a).

Channel Attention Mechanism (CAM): similar to spatial attention. As shown in fig. 5 (b), the CAM does not dimension map the features, but directly converts and transposes the feature map, and calculates correlations between channels by matrix multiplication with the converted original map. The formula is as follows:

Attention_c(X)＝(σ(A_c))^T＝(σ(XX^T)X)^T (3)

attention _c (X) represents channel Attention, A _c represents channel Attention, CAM has few parameters compared with PAM, and the superscript T is expressed as transposed matrix, so that the performance is improved remarkably.

Orthogonal regularization aims to enhance the representativeness of features by reducing the feature correlation between different channels in the feature map, this effect being particularly pronounced on features after the attention module. Given the feature vector X ε R ^c×s, conventional hard regularization typically relies on Singular Value Decomposition (SVD), which is very expensive to calculate, especially for high-dimensional features. Soft orthogonal regularization can be used to optimize the condition number of XX ^T:

Where XX ^T denotes the orthogonal transformation of the eigenvector X, β is a small constant, and λ ₁ and λ ₂ are the maximum and minimum eigenvalues of the matrix XXT, respectively, Represented as orthogonal regularization penalty terms, for penalizing correlations between feature map channels,The square value of the Euclidean distance 2 norm is represented, the subscript 2 represents the 2 norm, and the 2 norm of the vector X is the sum of squares of all elements in X and the root number is opened again. By means of the rapid iteration method for solving the characteristic values, parameter regularization can be rapidly achieved in the training process. In this embodiment, regularization is not applied on a single feature map alone, but after each self-attention module to strengthen the orthogonality of all feature channels at different stages.

Loss calculation

During the training phase of the network, this embodiment uses L _softmax (Softmax loss function) for classification, as shown below:

Wherein, Representing the weight vector of the pedestrian category K in the data set, f _i representing the feature vector of the pedestrian, C representing the total number of pedestrian identities in the data set, K representing the pedestrian identities in the data set, and N representing the number of images of one batch in the training process. The downstream features of each branch, which are applied to the downstream features of each branch after passing through the BN layer and the FC layer, are { I1_pt, I2_pt, I2_p1L, I2_p2L, I3_pt, I3_p1L, I3_p2L, I3_p3L }; the corresponding downstream features are { fg_L1, fg_L2, I2_p1, I2_p2, fg_L3, I3_p1, I3_p2, I3_P3}. The downstream features involved in splicing and fusion are: { fg_L1, I2_p1, I2_p2, I3_p1, I3_p2, I3_P3}.

In addition, the present embodiment also uses L _triplet (batch HARD TRIPLET loss) for metric learning, the formula is as follows:

Where in a small batch gradient dip of batch size p k, p denotes the number of pedestrians selected from the dataset, k denotes the number of photos selected for each pedestrian, f ⁽ⁱ⁾ and f ^(j) denote pedestrian features of different identities, f _a is a randomly selected pedestrian feature Anchor, f _p is a feature having the same pedestrian identity as the anchor, f _n is a feature having a different pedestrian identity than the anchor, The squared value representing the euclidean distance 2 norm, α is a hyper-parameter controlling the separation between two euclidean distances, and p represents a process variable, which is a real number. The triplet loss function is set in each branch for the features after the 1x1 convolutional layer, { fg_l1, fg_l2, fg_l3}, i.e., the end feature, the second metric feature, and the third metric feature, respectively.

For image pairs embedded in space, the triplet loss optimizes mainly euclidean distance, while the Softmax loss function optimizes mainly cosine distance. If the network output feature vector is optimized using both of these loss functions together, it may occur that one loss function is decreasing while the other is oscillating. In order to solve the problem, a batch normalization layer BN layer is added between the full connection layers FC, the batch normalization layer can balance each dimension of the feature vector and keep compact distribution of the same identity person, and the batch normalization operation is to normalize a batch of image features during training. In training, several pictures (batch_size) are trained simultaneously for each training. Since pedestrians may have misalignment problems, using a triplet loss function on the partial features of the horizontally segmented pedestrian picture may cause the model to learn a strange feature during training, and thus embodiments do not set the triplet loss function on the segmentation features of the second and third branches.

The overall loss is the sum of the arithmetic averages of each loss, and in order to reduce the mutual interference between branches at training, a weight parameter is added before the triplet loss. The total loss formula is as follows:

The calculation method of the total loss is expressed as:

where L _total is the total loss, L _softmax is the loss of process features, m is the number of process features, L _triplet is the triplet loss, n is the number of second loss features, A penalty term denoted as orthogonal regularization.

Example 3

The embodiment provides a system for implementing the pedestrian re-recognition method, as shown in fig. 6, which comprises a multi-branch network terminal acquisition module 1, a training module 2, a feature recognition module 3 and a re-recognition module 4;

the multi-branch network acquisition module 1 is used for acquiring a multi-branch network based on a residual neural network;

The training module 2 is used for training an identification model based on the multi-branch network;

The feature recognition module 3 is used for carrying out feature recognition on the pedestrian image and the image to be recognized according to the recognition model;

The re-recognition module 4 is used for re-recognizing pedestrians according to the similarity of the recognition features.

Wherein, as shown in fig. 4, the recognition model comprises a main branch, a second branch, a third branch, a self-attention module, an orthogonal module and a loss calculation module;

The downsampling operation of the fifth convolutional layer is deleted, as a sixth convolutional layer,

fusing the features extracted by the third convolution layer and the fourth convolution layer, and enabling the obtained third fused feature to enter a second branch;

The third fusion feature is in the second branch, and a plurality of second segmentation features are obtained after pooling operation, segmentation and dimension reduction operation are sequentially carried out;

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-branched pedestrian re-identification method, the method comprising:

A multi-branch network based on a residual neural network is acquired,

The multi-drop network includes: a plurality of middle convolution layers and end convolution layers connected in sequence,

Features extracted by one or more intermediate convolutional layers enter the branches and enter downstream operations including: pooling and dividing the features entering the branches, performing point convolution operation to obtain a plurality of divided features,

Performing pooling operation on the features extracted by the tail end convolution layer, performing point convolution operation to obtain tail end features,

The terminal features and the segmentation features are fused to obtain identification features;

training an identification model based on the multi-branch network;

According to the recognition model, carrying out feature recognition on the pedestrian image and the image to be recognized;

according to the similarity of the identification characteristics, pedestrian re-identification is carried out;

the recognition model comprises a main branch, a second branch, a third branch, a self-attention module, an orthogonal module and a loss calculation module;

deleting the downsampling operation of the fifth convolution layer to serve as a sixth convolution layer, wherein the sixth convolution layer connected with the fourth convolution layer serves as a second branch;

After the third fusion feature is fused with the feature extracted by the sixth convolution layer, the obtained second fusion feature enters a second branch;

2. The pedestrian re-recognition method of claim 1 wherein the intermediate convolution layers include a third convolution layer and a fourth convolution layer, the end convolution layer includes a fifth convolution layer, the branch includes a second branch,

the fifth convolution layer deletes the downsampling operation, and then is used as a sixth convolution layer of the second branch;

3. The pedestrian re-recognition method of claim 2 wherein the branches further include a third branch,

And fusing the first segmentation feature, the second segmentation feature and the tail end feature to obtain an identification feature.

4. The pedestrian re-recognition method according to claim 3, further comprising a feature fusion method between branches:

the fourth feature is fused with the third feature after up-sampling operation or interpolation operation to obtain a process feature;

After convolution operation, the process features are fused with the fourth features to obtain third fusion features entering a third branch;

And after pooling operation, the second fusion features are fused with features extracted by the tail end convolution layer to obtain first fusion features entering the main branch.

5. The pedestrian re-recognition method of claim 4 further comprising a semantic supervision method of:

The multi-branch network comprises a first convolution layer, a second convolution layer and a third convolution layer which are sequentially connected;

and according to the loss, supervising the extraction of the characteristics.

6. The pedestrian re-recognition method of claim 5 wherein the convolution block of the convolution operation includes a convolution with a kernel of 3x3 size, a BN layer, and a ReLU activation function;

In the third branch, the third fusion feature is subjected to horizontal segmentation through pooling operation with a convolution kernel of 8 multiplied by 8, so as to obtain a plurality of second segmentation blocks;

7. The pedestrian re-recognition method of claim 6 wherein the third feature extracted by the third convolutional layer and the fourth feature extracted by the fourth convolutional layer are measured using a self-attention mechanism;

8. The pedestrian re-recognition method according to claim 7, further comprising a loss calculation method of:

the third fusion feature is subjected to 24 x 8 pooling and 1x1 point convolution dimension reduction operation, and then a third measurement feature is obtained;

the second fusion feature is subjected to 24 x 8 pooling and 1x1 convolution dimension reduction operation, and a second metric feature is obtained;

The calculation method of the total loss is expressed as:

Where L _total is the total loss, L _softmax is the classification loss of the downstream features, m is the number of downstream features, L _triplet is the loss of the metric features, n is the number of metric features, and L _or is the penalty term for orthogonal regularization.

9. A system for implementing the pedestrian re-recognition method of any one of claims 1-8, comprising a multi-branch network acquisition module, a training module, a feature recognition module, and a re-recognition module;