CN114241278A

CN114241278A - Multi-branch pedestrian re-identification method and system

Info

Publication number: CN114241278A
Application number: CN202111641939.2A
Authority: CN
Inventors: 何东之; 王鹏飞; 孙亚茹; 张震; 郭隆杭
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-03-25
Anticipated expiration: 2041-12-29
Also published as: CN114241278B

Abstract

The invention discloses a multi-branch pedestrian re-identification method and a multi-branch pedestrian re-identification system, which belong to the technical field of machine learning, and comprise the following steps: obtaining a multi-branch network based on a residual neural network, the multi-branch network comprising: the features extracted by one and/or more intermediate convolutional layers enter a branch and enter downstream operation, and the obtained end features are fused with the segmentation features; training a recognition model based on the multi-branch network; according to the identification model, carrying out feature identification on the pedestrian image and the image to be identified; and re-identifying the pedestrian according to the similarity of the features. The features extracted by the middle convolutional layer have good foreground spatial information and pedestrian local features, the features are obtained after branching, the segmentation features are obtained, and the end features extracted by the end convolutional layer have rich semantic information; the segmentation features and the high-level features are fused, so that the method has good spatial information and rich semantic information finally, and avoids the interference of the background on feature matching.

Description

Multi-branch pedestrian re-identification method and system

Technical Field

The invention relates to the technical field of machine learning, in particular to a multi-branch pedestrian re-identification method and system.

Background

With the rapid development of urban electronic information, video monitoring analysis has become one of the main means for investigation and evidence collection of the new generation. The pedestrian re-identification technology is a very key technology for video analysis, and mainly means that a pedestrian image shot by a camera is given, and on one hand, whether a pedestrian target can reappear in a camera monitoring area is judged; on the other hand, it is also determined whether the pedestrian target appears in the area monitored by other cameras, and then the position of the pedestrian target is located according to the position of the camera containing the pedestrian target, which is widely regarded as a sub-problem of image retrieval. Given, for example, a pedestrian image, the detection of which and the trajectory of motion across the device, pedestrian re-identification presents many different challenges due to the complexity of real-world scenes, including, primarily, large changes in pose and clothing, severe occlusion, background clutter, etc.

The pedestrian re-identification task can be divided into two processes of feature extraction and metric learning, wherein the feature extraction is to obtain stable visual features with high discriminability; the metric learning mainly uses different metric standards to calculate the similarity of the extracted pedestrian feature vectors, so that the feature vectors of the same-class samples are closer to the feature vectors of the non-same-class samples. To be able to extract more distinctive features, many approaches focus on learning a way of combining global and local feature representations to enhance the feature representation. In recent years, Convolutional Neural Networks (CNNs) have shown strong feature representation capability in various deep learning tasks, and there are the following categories:

one is to use a priori knowledge of the pose or body identity to locate the body part. In this case, however, the performance of the convolutional neural network is highly dependent on the robustness of the pose or body identity estimation model. Unexpected errors such as attitude estimation errors may greatly affect the recognition result.

And the other type is that the original picture or the feature graph is divided into strips by assuming image alignment, measurement learning is respectively carried out, and finally the strip features are spliced to be used as final feature representation. Although the convolutional neural network-based method has been successful, the detail information is lost due to the continuous convolution and down-sampling operations, and the final features of the convolutional network are only fuzzy global features, which is not beneficial for detail matching. In addition, the influence on the same pedestrian is large for different backgrounds, which brings great noise to the identification features and interferes with feature matching.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a multi-branch pedestrian re-identification method and system, which are used for extracting high-level features with rich semantic information and low-level features with good spatial information respectively through different branches and fusing the high-level features and the low-level features to avoid the interference of the background on feature matching.

The invention discloses a multi-branch pedestrian re-identification method, which is used for acquiring a multi-branch network based on a residual error neural network, wherein the multi-branch network comprises the following components: a plurality of intermediate convolutional layers and end convolutional layers connected in series, the features extracted by one and/or more of the intermediate convolutional layers branching into and entering downstream operations comprising: performing pooling and segmentation on the features entering the branches, performing point convolution operation to obtain a plurality of segmentation features, performing point convolution operation on the features extracted from the tail-end convolution layer after performing pooling operation to obtain tail-end features, and fusing the tail-end features and the segmentation features to obtain identification features; training a recognition model based on the multi-branch network; according to the identification model, carrying out feature identification on the pedestrian image and the image to be identified; and re-identifying the pedestrian according to the similarity of the identification features.

Preferably, the intermediate convolutional layers comprise a third convolutional layer conv3 and a fourth convolutional layer conv4, the end convolutional layers comprise a fifth convolutional layer conv5, the branches comprise a second branch,

the third convolution layer, the fourth convolution layer and the end convolution layer are connected in sequence;

a sixth convolutional layer conv6 as a second branch after the fifth convolutional layer downsampling operation is deleted;

the features extracted by the fourth convolution layer are sent into the sixth convolution layer for feature extraction to obtain sixth features;

and after the sixth feature is segmented through pooling operation, performing point convolution to obtain a plurality of first segmentation features.

Preferably, the branches further comprise a third branch,

in the third branch, after the third feature extracted by the third convolutional layer and the fourth feature extracted by the fourth convolutional layer are fused, pooling operation, segmentation and point convolution operation are sequentially carried out to obtain a plurality of second segmentation features;

fusing the first segmentation feature, the second segmentation feature and the end feature.

Preferably, the feature fusion method between branches:

the fourth feature is fused with the third feature after the up-sampling operation or the interpolation operation;

after pooling operation, the fused feature is fused with the fourth feature again to obtain a third fused feature entering the second branch;

after pooling operation, the third fusion feature is fused with the feature extracted from the sixth convolution layer to obtain a second fusion feature entering the second branch;

the second fused feature is integrated with the feature extracted from the end convolution layer after pooling operation to obtain a first fused feature entering the main branch.

Preferably, the semantic supervision method comprises:

the multi-branch network packet comprises a first convolution layer, a second convolution layer and a third convolution layer which are connected in sequence;

the features extracted from the first, second and third convolutional layers are feat0, feat1 and feat2, respectively;

after convolution operation and upsampling operation, the feat2 is fused with feat1 to obtain a characteristic de _ feat 1;

after the convolution operation and the upsampling operation of de _ feat1, the de _ feat1 is fused with feat0 to obtain a characteristic de _ feat 0;

after the de _ flat 0 convolution operation and the up-sampling operation, obtaining a characteristic de _ flat consistent with the size of the pedestrian image;

obtaining a foreground mask image of the pedestrian image according to the segmentation model;

calculating the loss between de _ flat and the foreground mask map by using a cross entropy loss function;

and monitoring the extraction of the features according to the loss.

Preferably, the convolution block of the convolution operation includes convolution with a kernel size of 3 × 3, a BN layer, and a ReLU activation function;

in the third branch, the third fused feature is horizontally divided through a pooling operation with a kernel of 8x8 to obtain a plurality of second divided blocks;

after the second segmentation block passes through a point convolution layer of 1x1, a plurality of 256-dimensional second segmentation features are obtained;

performing horizontal segmentation on the second fusion characteristic through pooling operation with convolution kernel of 12 × 8 to obtain a plurality of first segmentation blocks;

after the first segmentation block passes through a point convolution layer of 1x1, obtaining a plurality of first segmentation features reduced to 256 dimensions;

the first fused feature passes through a pooling layer with convolution kernel of 12 × 4 and a point convolution layer of 1 × 1 to obtain an end feature of 256 dimensions.

Preferably, the third feature extracted by the third convolutional layer and the fourth feature extracted by the fourth convolutional layer are weighted by using a self-attention mechanism;

the representativeness of the features after the self-attention mechanism measurement is enhanced by utilizing orthogonal regularization.

Preferably, the method of loss calculation comprises:

after 24 × 8 pooling and 1 × 1 point convolution dimensionality reduction operation are performed on the third fusion feature, a third metric feature fg _ L3 is obtained;

after 24 × 8 pooling and 1 × 1 convolution dimensionality reduction operations are performed on the second fusion feature, a second metric feature fg _ L2 is obtained;

after the downstream features of each branch sequentially pass through the batch processing normalization layer and the full connection layer, calculating classification loss through a Softmax loss function;

computing, by a triplet loss function, a loss of a metric feature, the metric feature comprising: an end feature, a second metrology feature, and a third metrology feature;

the total loss is calculated as:

wherein L is_totalTo total loss, L_softmaxExpressed as the loss of classification of the downstream features, m is expressed as the number of downstream features, L_tripletExpressed as loss of metrology features, n is expressed as a number of metrology features,

denoted as penalty term for orthogonal regularization.

The invention also provides a system for realizing the pedestrian re-identification method, which comprises a multi-branch network terminal acquisition module, a training module, a feature identification module and a re-identification module;

the multi-branch network acquisition module is used for acquiring a multi-branch network based on a residual error neural network;

the training module is used for training a recognition model based on the multi-branch network;

the characteristic identification module is used for carrying out characteristic identification on the pedestrian image and the image to be identified according to the identification model;

and the re-identification module is used for re-identifying the pedestrian according to the similarity of the identification features.

Preferably, the recognition model comprises a main branch, a second branch, a third branch, a self-attention module, an orthogonal module and a loss calculation module;

the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer are connected in sequence and used as main branches;

deleting the downsampling operation of the fifth convolutional layer as a sixth convolutional layer,

a sixth convolutional layer connected to the fourth convolutional layer as a second branch;

fusing the extracted features of the third convolutional layer and the fourth convolutional layer, and enabling the obtained third fused feature to enter a third branch;

the third fusion feature is subjected to pooling operation, segmentation and dimension reduction operation in a third branch in sequence to obtain a plurality of second segmentation features;

after the third fusion feature is fused with the feature extracted by the sixth convolutional layer, the third fusion feature is used as a second fusion feature to enter a second branch;

the second fusion features are subjected to pooling operation, segmentation and dimension reduction operation in the second branch in sequence to obtain a plurality of first segmentation features;

fusing the second fusion feature with the feature extracted from the fifth convolution layer in the main branch to obtain a first fusion feature;

the first fusion feature is in a main branch, and after pooling operation and dimension reduction operation are sequentially carried out, a terminal feature is obtained;

after the terminal features, the first segmentation features and the second segmentation features are fused, identification features of the pedestrian image and the image to be identified are obtained;

the self-attention module is to: optimizing the third and fourth features extracted from the third and fourth convolutional layers based on a self-attention mechanism;

the orthogonal module is used for re-optimizing the optimized characteristics of the self-attention module according to a regularized orthogonalization method;

and the loss calculation module is used for calculating the loss of the recognition model according to the features extracted by each branch.

Compared with the prior art, the invention has the beneficial effects that: based on a multi-branch network of a residual error neural network, the features extracted by the middle convolutional layer have good foreground spatial information and pedestrian local features, the segmentation features with good spatial information are obtained after the features enter the branches, and the tail end features extracted by the tail end convolutional layer have rich semantic information; the segmentation features and the terminal features are fused, so that the final features have good spatial information and rich semantic information at the same time, and the interference of the background on feature matching is effectively avoided.

Drawings

FIG. 1 is a flow chart of a multi-branch pedestrian re-identification method of the present invention;

FIG. 2 is a logical block diagram of a multi-drop network;

FIG. 3 is a logical block diagram of a multi-branch network in embodiment 1;

FIG. 4 is a logical block diagram of a multi-branch network in embodiment 2;

FIG. 5 is a schematic diagram of an attention mechanism;

FIG. 6 is a logical block diagram of the system of the present invention;

fig. 7 is a schematic diagram of semantic supervision.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The invention is described in further detail below with reference to the attached drawing figures:

a multi-branch pedestrian re-identification method, as shown in fig. 1, the method comprising:

step 101: a multi-branch network based on a residual neural network (ResNet) is obtained. Wherein the residual neural network comprises ResNet18, ResNet34, ResNet50, ResNet101 or ResNet 152.

Fig. 2 shows a logical structure of a multi-branch network comprising: a plurality of intermediate convolution layers conv2, conv3, conv4 and end convolution layers conv5 connected in this order;

wherein the features extracted by the one and/or more intermediate convolutional layers branch into and enter downstream operations;

the downstream operations include: performing pooling and segmentation on the features entering the branches, and performing point convolution operation to obtain a plurality of segmentation features;

performing pooling operation on the extracted features of the tail end convolution layer, and performing point convolution operation to obtain tail end features;

and fusing the terminal features and the segmentation features to obtain identification features.

Step 102: training a recognition model based on the multi-branch network. In training the recognition model, a training set is constructed and trained, and a method for obtaining the recognition model is the prior art and is not repeated in the invention.

Step 103: and respectively carrying out feature recognition on the pedestrian image and the image to be recognized according to the recognition model to obtain recognition features.

Step 104: and re-identifying the pedestrian according to the similarity of the identification features.

Based on a multi-branch network of a residual error neural network, low-level features extracted from the middle convolutional layer have good foreground spatial information and pedestrian local features, segmentation features with good spatial information are obtained after the low-level features enter branches, and tail-end features extracted from the tail-end convolutional layer have rich semantic information; the segmentation features and the terminal features are fused, so that the final features have good spatial information and rich semantic information at the same time, and the interference of the background on feature matching is effectively avoided.

In fig. 2, the convolutional layer comprises: the first convolution layer conv1, the second convolution layer conv2, the third convolution layer conv3 and the fourth convolution layer conv4 are connected in sequence, and the end convolution layer is a fifth convolution layer conv 5.

Example 1

As shown in fig. 3, the features extracted by the conv4 enter the second branch, the second downstream operation is performed, after the downsampling operation is deleted from the fifth convolutional layer conv5, the features extracted by the conv4, which is the sixth convolutional layer conv6 in the second downstream operation, enter the conv6, and the sixth features are extracted, and the second downstream operation may further include a Pooling operation (Pooling), a segmentation operation, and a point convolution, so as to obtain a plurality of first segmentation features. Operations of conv5 include a down-sampling operation and a convolution operation, wherein the down-sampling operation (under-sampling) is used for reducing the image and leads to loss of partial feature space details, and the down-sampling operation is deleted in conv6, so that the feature size of the image processed by conv6 is consistent with that in conv4, and the space details in conv4 are reserved.

The features extracted by the intermediate convolutional layers conv4 and conv3 go to the third branch for the third downstream operation. The third downstream operation includes: fusing, pooling, segmenting and point convolving to obtain a plurality of second segmentation features. The spatial structure information of the fourth feature extracted by conv4 and the third feature extracted by conv3 preserves good local features of pedestrians, and the spatial information in the third branch is more than that in the second branch.

The downstream operations of the main branch include pooling and point convolution.

Dividing the second branch and the third branch, dividing the features into a plurality of feature blocks, such as 2 or 3 feature blocks, respectively performing metric learning, and finally splicing the division features generated by the feature blocks; the pooling operation may employ both average pooling and maximum pooling to share pooling characteristics, and in the present invention, pooling is employed for feature segmentation. After the point convolution operation of different branches, the features of the branches are reduced to a uniform dimension, such as 256 dimensions, so that feature fusion is facilitated.

Wherein the segmentation features comprise a first segmentation feature of the first branch and a second segmentation feature of the second branch. And fusing the first segmentation feature, the second segmentation feature and the end feature to obtain the identification feature.

Example 2

The embodiment takes ResNet50 as a base residual neural network, shares conv1-conv5 therein, and deletes the last pooling layer and the full connection layer. But the residual neural network is not limited thereto.

Inter-branch feature fusion

As shown in fig. 4, the feature fusion method between branches includes:

step 201: and fusing the fourth feature with the third feature after the up-sampling operation (up-sampling) or interpolation operation is carried out on the fourth feature to obtain the process feature. And after the up-sampling operation of the fourth feature, the size of the fourth feature is consistent with that of the third feature.

Step 202: and after the convolution operation of the process features, fusing the process features with the fourth features to obtain third fused features entering the third branch.

In a third downstream operation of the third branch: the third fused feature is horizontally divided through a pooling operation with 8 × 8 convolution kernels to obtain a plurality of second divided blocks f3_ p1, f3_ p2 and f3_ p3, and after the second divided blocks respectively pass through a point convolution layer of 1x1, a plurality of second divided features from 256 dimensions are obtained, namely l3-p1, l3-p2 and l3-p 3: and after the second segmentation characteristic is processed by BN and FC, obtaining a second segmentation loss characteristic: l3-p1L, l3-p2L, l3-p 3L; and after performing 24 × 8 pooling and 1 × 1 point convolution on the third fusion and performing dimension Reduction (Dim Reduction), obtaining a third metric feature fg _ L3, and after the third metric feature fg _ L3 passes through a BN layer and an FC layer, obtaining a second loss feature L3-pt.

Step 203: after the third fused feature pooling operation, it is fused with the features extracted by the sixth convolution layer conv6 to obtain a second fused feature into the second branch. After the third fused feature pooling operation, the sizes of the third fused feature and the sixth feature extracted by conv6 are consistent when the third fused feature is fused; if the third blend feature is the same size as the sixth feature, pooling may not be performed.

In the downstream operation of the second branch, the second fused feature is horizontally divided through a pooling operation with a convolution kernel of 12 × 8 to obtain a plurality of first divided blocks f2_ p1 and f2_ p 2; after the first segmentation block passes through a point convolution layer of 1x1, obtaining a plurality of first segmentation features l2_ p1 and l2_ p2 which are reduced to 256 dimensions; after the first segmentation characteristic is processed by BN and FC, obtaining first segmentation loss characteristics l2_ p1L and l2_ p 2L; after 24 × 8 pooling and 1 × 1 convolution dimensionality reduction operations are performed on the second fusion feature, a second metric feature fg _ L2 is obtained; the second metric characteristic passes through the BN layer and the FC layer, and then the first loss characteristic l2_ pt is obtained.

Step 204: after the second fused feature pooling operation, it is fused with the features extracted from the end convolution layer to obtain the first fused feature entering the main branch.

In the downstream operation of the main branch: the first fusion feature is subjected to a pooling layer with convolution kernel of 12 x 4 and a point convolution layer of 1x1 to obtain a 256-dimensional end feature fg _ L1; the end feature is processed by a normalization (BN) layer and a Full Connectivity (FC) layer to obtain a first loss feature l1_ pt.

Spatial information in the primary and secondary branches is improved by feature fusion between the branches. The multi-branch structure forms a lightweight bidirectional feature pyramid network (LB-FPN), the number of input feature diagram channels of each branch is different, the feature diagrams of different branches are unified to 256-dimensional channels through point convolution at the downstream of each branch, and the features of different branches are fused after Batch Normalization (BN) operation is used. The final fusion characteristics, see fig. 4.

In the invention, the fusion between the feature maps with different sizes or resolutions has the following two modes: firstly, by interpolation operation, after the size or resolution of a small image or a feature image with low resolution is increased, fusion is carried out, namely a top-down fusion mode; secondly, the large image or the characteristic image with high resolution is fused after the size or the resolution is reduced through the pooling operation, namely a bottom-up fusion mode.

Semantic supervision

The embodiment further introduces a semantic supervision method, which increases the extraction of the conv2 and conv3 on the foreground and weakens the role of the background in the features, as shown in fig. 7, the semantic supervision method includes:

step 301: the features extracted from the first, second and third convolutional layers are feat0, feat1 and feat2, respectively.

Step 302: after convolution operation and upsampling operation, the feat2 is fused with feat1 to obtain the characteristic de _ feat 1. The convolution block of the convolution operation comprises a convolution layer with a kernel of 3x3 size, a BN layer and a ReLU activation function, the upsampling operation is used for enabling the feature to be consistent with the size of feat1, and the feature is changed into twice of the original size in a specific bilinear interpolation mode.

Step 303: after the convolution operation and the upsampling operation, the de _ feat1 is fused with feat0 to obtain the characteristic de _ feat 0. The upsampling operation is then matched to the size of the feature feat 0.

Step 304: after the convolution operation and the upsampling operation, de _ feat0 obtains a characteristic de _ feat that is consistent with the size of the pedestrian image.

Step 305: and obtaining a foreground mask image of the pedestrian image according to the segmentation model.

Step 306: the penalty between de _ feat and the foreground mask map is calculated using a cross entropy penalty function.

Step 307: and monitoring the extraction of the features according to the loss.

The foreground mask image after the segmentation of the existing human body segmentation model is used for supervising the feature extraction of the convolution layer, so that the extracted features are closer to the foreground features, the extraction of the foreground is enhanced, and the effect of the background in the features is weakened. The background information also plays a certain role in identification, and the extraction of the background is weakened in the invention, and the background information is not completely removed.

The semantic supervision method is only used in model training and is not used in feature recognition and reasoning, and the reasoning speed and efficiency can be improved.

Self-attention mechanism and orthogonal regularization

The present embodiment also measures the third feature extracted by the third convolution layer conv3 and the fourth feature extracted by the fourth convolution layer conv4 by using a self-attention mechanism; and the representativeness of the features after the self-attention mechanism measurement is enhanced by utilizing orthogonal regularization.

Attention modules are an effective mechanism in various machine learning scenarios, exploiting the correlation between features to help models focus on more important features. In the invention, the generalization capability of the model can be effectively improved by utilizing the self-attention module, so that the model focuses on the relationship between the features. As shown in fig. 4, a self-attention mechanism is inserted in the network backbone, and the self-attention mechanism is composed of space attention (PAM) and Channel Attention (CAM), respectively.

Spatial attention is as follows: given an input feature mapping X ∈ R^C×H×WWhere C, H, W are the number of channels, height and width, respectively, and R represents the real number set in the mathematics. The PAM maps the features of each location to a lower-dimensional subspace, thereby generating three lower-dimensional feature maps. The three signatures are then correlated. The formula is as follows:

wherein the Attention_p(X) represents the spatial attention of the user,

represents the corresponding position multiplication, sigma (.) is the Softmax function, Q, K, V ∈ R^C×SIs a feature transform of the same size as X, S ═ H × W, Q^TDenotes a transposed matrix, A_pRepresenting spatial attention. PAM can derive correlations between different positions of the feature map. PAM re-weights each feature by its feature at each location and the features at other locations as shown in fig. 5 (a).

Channel Attention Mechanism (CAM): similar to spatial attention. As shown in fig. 5(b), the CAM does not perform dimension mapping on the features, but directly converts and transposes the feature map, and performs matrix multiplication with the converted original image to calculate the correlation between the channels. The formula is as follows:

Attention_c(X)＝(σ(A_c))^T＝(σ(XX^T)X)^T (3)

Attention_c(X) denotes channel attention, A_cIndicating channel attention, the CAM has few parameters compared to PAM, and the superscript T is denoted as a transposed matrix, significantly improving performance.

Orthoregularization aims to enhance the representation of features by reducing the feature correlation between different channels in a feature map, which does soThe features used behind the attention module are particularly apparent. Given a feature vector X ∈ R^c×sConventional hard regularization typically relies on Singular Value Decomposition (SVD), which is very expensive to compute, especially for high-dimensional features. XX may be optimized using soft orthogonal regularization^TCondition number of (2):

wherein, XX^TMeans that the eigenvector X is orthogonally transformed, beta being a small constant, and₁and λ₂Maximum and minimum eigenvalues of the matrix XXT respectively,

and is expressed as an orthogonal regularization penalty term for penalizing correlation between the characteristic diagram channels,

the squared value of the euclidean distance 2 norm is shown, the subscript 2 shows the 2 norm, and the 2 norm of the vector X is the sum of the squares of the elements in X followed by the root. By means of the fast iteration method for solving the characteristic values, parameter regularization can be rapidly achieved in the training process. In this embodiment, instead of applying regularization solely on a single feature map, regularization is applied after each self-attention module to enhance the orthogonality of all feature channels at different stages.

Loss calculation

In the training phase of the network, the present embodiment uses L_softmax(Softmax loss function) for classification, the formula is as follows:

wherein the content of the first and second substances,

weight vector, f, representing pedestrian class K in the data set_iThe feature vector represents the pedestrian, C represents the total number of the pedestrian identities in the data set, k represents the pedestrian identities in the data set, and N represents the number of images of one batch in the training process. The loss function is applied to the characteristics of downstream characteristics of each branch after passing through a BN layer and an FC layer, namely { I1_ pt, I2_ pt, I2_ p1L, I2_ p2L, I3_ pt, I3_ p1L, I3_ p2L and I3_ p3L }; the corresponding downstream features are { fg _ L1, fg _ L2, I2_ P1, I2_ P2, fg _ L3, I3_ P1, I3_ P2, I3_ P3 }. Wherein, the downstream features participating in splicing and fusion are: { fg _ L1, I2_ P1, I2_ P2, I3_ P1, I3_ P2, I3_ P3 }.

In addition, this embodiment also uses L_triplet(batch hard triplet loss) for metric learning, the formula is as follows:

wherein in a small batch gradient descent with a batch size of p x k, p represents the number of pedestrians selected from the data set, k represents the number of photographs selected per pedestrian, f⁽ⁱ⁾And f^(j)Features of pedestrians representing different identities, f_aIs a randomly selected pedestrian feature anchor, f_pIs a feature of the same pedestrian identity as the anchor, f_nIs a feature of a different pedestrian identity from the anchor,

the squared value of the 2 norm of the euclidean distance is represented, α is a hyperparameter controlling the separation between the two euclidean distances, and p represents a process variable and is a real number. The characteristics of the triplet loss function set in each branch after being used for the 1x1 convolutional layer are { fg _ L1, fg _ L2, fg _ L3}, i.e., the end characteristic, the second metric characteristic, and the third metric characteristic, respectively.

For image pairs embedded in space, the triplet loss mainly optimizes the euclidean distance, while the Softmax loss function mainly optimizes the cosine distance. If the two loss functions are used together for optimization of the network output feature vector, one loss function reduction may occur, while the other loss function oscillates. In order to solve the problem, a batch processing normalization layer BN layer is added between full connection layers FC, the batch processing normalization layer can balance each dimension of a feature vector and keep compact distribution of people with the same identity, and batch normalization operation is performed on a batch of image features during training. During training, each training will train several pictures (Batch _ Size) at the same time. Since the pedestrian may have a misalignment problem, the use of the triple loss function on the local features of the horizontally partitioned pedestrian picture may cause the model to learn strange features in the training process, and therefore the embodiment does not set the triple loss function on the segmentation features of the second branch and the third branch.

The total loss is the sum of the arithmetic mean of each loss, and in order to reduce mutual interference between branches during training, a weight parameter is added in front of the triplet loss. The total loss equation is as follows:

the total loss is calculated as:

wherein L is_totalTo total loss, L_softmaxExpressed as loss of process features, m is expressed as the number of process features, L_tripletExpressed as a triplet penalty, n is expressed as the number of second penalty features,

denoted as penalty term for orthogonal regularization.

Example 3

The present embodiment provides a system for implementing the pedestrian re-identification method, as shown in fig. 6, including a multi-branch network terminal obtaining module 1, a training module 2, a feature recognition module 3, and a re-identification module 4;

the multi-branch network acquisition module 1 is used for acquiring a multi-branch network based on a residual error neural network;

the training module 2 is used for training a recognition model based on the multi-branch network;

the feature recognition module 3 is used for carrying out feature recognition on the pedestrian image and the image to be recognized according to the recognition model;

and the re-identification module 4 is used for re-identifying the pedestrian according to the similarity of the identification features.

Wherein, as shown in fig. 4, the recognition model comprises a main branch, a second branch, a third branch, a self-attention module, an orthogonal module and a loss calculation module;

fusing the extracted features of the third convolutional layer and the fourth convolutional layer, and enabling the obtained third fused feature to enter a second branch;

the third fusion feature is subjected to pooling operation, segmentation and dimension reduction operation in the second branch in sequence to obtain a plurality of second segmentation features;

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-branch pedestrian re-identification method, the method comprising:

a multi-branch network based on a residual neural network is obtained,

the multi-drop network comprises: a plurality of intermediate convolution layers and end convolution layers connected in series,

one or more of the intermediate convolutional layer extracted features branch into and enter downstream operations comprising: pooling and segmenting the features entering the branch, performing point convolution operation to obtain multiple segmented features,

performing pooling operation on the extracted features of the tail end convolution layer, performing point convolution operation to obtain tail end features,

fusing the terminal features and the segmentation features to obtain identification features;

training a recognition model based on the multi-branch network;

according to the identification model, carrying out feature identification on the pedestrian image and the image to be identified;

and re-identifying the pedestrian according to the similarity of the identification features.

2. The pedestrian re-identification method according to claim 1, wherein the intermediate convolutional layers include a third convolutional layer and a fourth convolutional layer, the end convolutional layer includes a fifth convolutional layer, the branch includes a second branch,

the fifth convolutional layer is used as a sixth convolutional layer of the second branch after deleting the down-sampling operation;

3. The pedestrian re-identification method according to claim 2, wherein the branches further include a third branch,

and fusing the first segmentation feature, the second segmentation feature and the end feature to obtain the identification feature.

4. The pedestrian re-identification method according to claim 3, further comprising a feature fusion method between branches:

the fourth feature is fused with the third feature after the up-sampling operation or the interpolation operation, so as to obtain a process feature;

after convolution operation, the process features are fused with the fourth features to obtain third fusion features entering a third branch;

the second fused feature is fused with the feature extracted from the end convolution layer after pooling operation to obtain a first fused feature entering the main branch.

5. The pedestrian re-identification method according to claim 4, further comprising a method of semantic supervision:

and monitoring the extraction of the features according to the loss.

6. The pedestrian re-identification method according to claim 5, wherein the convolution block of the convolution operation includes a convolution with a kernel of a size of 3x3, a BN layer, and a ReLU activation function;

in the third branch, the third fusion feature is horizontally divided through pooling operation with convolution kernel of 8 × 8 to obtain a plurality of second division blocks;

7. The pedestrian re-identification method according to claim 6, wherein the third feature extracted by the third convolutional layer and the fourth feature extracted by the fourth convolutional layer are weighted by a self-attention mechanism;

8. The pedestrian re-identification method according to claim 7, further comprising a loss calculation method of:

after 24 x8 pooling and 1x1 point convolution dimensionality reduction operation are carried out on the third fusion feature, a third measurement feature is obtained;

after 24 x8 pooling and 1x1 convolution dimensionality reduction operation are carried out on the second fusion feature, a second metric feature is obtained;

the total loss is calculated as:

denoted as penalty term for orthogonal regularization.

9. A system for implementing the pedestrian re-identification method according to any one of claims 1 to 8, comprising a multi-branch network final acquisition module, a training module, a feature identification module and a re-identification module;

10. The system of claim 9, wherein the recognition model comprises a primary branch, a second branch, a third branch, a self-attention module, an orthogonality module, and a loss calculation module;

after the third fusion feature is fused with the feature extracted by the sixth convolutional layer, the obtained second fusion feature enters a second branch;