CN114299542B

CN114299542B - Video pedestrian re-identification method based on multi-scale feature fusion

Info

Publication number: CN114299542B
Application number: CN202111635259.XA
Authority: CN
Inventors: 艾明晶; 刘鹏高
Original assignee: Beihang University
Current assignee: Beihang University
Filing date: 2021-12-29
Publication date: 2024-07-05
Anticipated expiration: 2041-12-29

Abstract

The invention discloses a video pedestrian re-recognition method based on multi-scale feature fusion, and provides a video pedestrian re-recognition network model based on multi-scale feature fusion, aiming at the problem that the traditional method is poor in effect when time sequence fusion is carried out on complex apparent features. The model leads three branches at the end of the backbone network: the method comprises the steps of respectively extracting image level re-identification features and time sequence attention weights of different scales from a global feature branch, a local feature branch and a time sequence attention branch, splicing and fusing the re-identification feature vectors of different scales according to the time sequence attention weights, finally realizing accurate pedestrian re-identification through a multi-feature independent training strategy, and optimizing structural parameters of a network such as the number of local features, the size of the local features and the number of Bottleneck through a comparison experiment. Experiments prove that the mAP and rank-1 indexes respectively reach 78.7% and 85.1% on a Mars data set, which is superior to most of the existing methods.

Description

Video pedestrian re-identification method based on multi-scale feature fusion

Technical Field

The invention relates to the field of computer vision and image processing, in particular to a video pedestrian re-identification method based on multi-scale feature fusion. The distinguishing degree of the apparent features of the image-level pedestrians is improved mainly through global feature branches and local feature branches of the video pedestrian re-recognition network model based on multi-scale feature fusion, the performance of the network model is improved through optimizing the number and the size of the local features and the number of Bottleneck structures, the problem that time sequence fusion is difficult to effectively perform on complex image-level features is solved, and finally the obtained video-level pedestrian re-recognition features can achieve higher re-recognition accuracy.

Background

As a biological recognition technology, pedestrian Re-recognition (ReID) is different from unique marks such as face recognition, iris recognition and fingerprint recognition, and mainly depends on the apparent characteristics of pedestrians, and is closely related to the appearance characteristics such as clothes, gestures and the like, so that the application prospect is wider.

The pedestrian re-identification technology based on the image has excellent performance, but has poor robustness in practical application scenes such as video monitoring, and is easy to be subjected to false matching when the people flow environment is complex. Information of a single frame image is generally limited, so that pedestrian re-recognition based on video is of great research significance. A typical video-based pedestrian re-recognition system consists of two parts, an image-level feature extractor (e.g., convolutional neural network) and a modeling approach to aggregate temporal features. The most core advantage of this method is that not only the content information of a single frame image but also the motion information between frames, etc. are considered.

At present, main research work of a pedestrian re-identification algorithm facing videos is concentrated on processing time sequence information in an image sequence, and complex apparent features have poor effects when time sequence fusion is carried out, so that most of used image-level features are global features. This allows for low discrimination of re-identified features, optimization with consideration of the introduction of local features, and the introduction of local features should be effectively combined with the time series information extraction model to improve algorithm performance.

In the early pedestrian re-identification methods, a feature vector is obtained by using the whole image to identify, and global features are mainly focused. With the increase of the pedestrian data volume, the network structure deepens, the requirement on the pedestrian re-recognition performance can be met only by introducing local features, and the common local feature extraction modes comprise skeleton key point positioning, image blocking, pose correction and the like.

In 2016, varior et al, proposed a siamese long short-term memory long-short-term memory network for pedestrian re-recognition, which was fed into the network by vertical picture slicing, and the resulting features fused with the local features of the input image block, the method was relatively high in image alignment (reference document ：R.Varior,B.Shuai,J.Lu,D.Xu,and G.Wang,"A Siamese Long Short-Term Memory Architecture for Human Re-identification,"in Proceedings of the European Conference on Computer Vision.Springer,Cham,2016,pp.135-153.).

In 2017, zhao et al proposed SPINDLE NET that 14 key points are located by a pedestrian, the pose is estimated based on the 14 key points, the human body is segmented into 7 areas, local features on different scales are extracted, then a global feature is extracted by inputting a picture, and the two are fused. (reference document) ：H.Zhao,M.Tian,and S.Sun et al,"Spindle Net:Person Re-identification with Human Body Region Guided Feature Decomposition and Fusion,"in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017,pp.907-915.).

In 2017 Zhang et al proposed ALIGNEDREID a method of local block alignment beyond human performance, and after uniform blocking, each part of the local block shortest path was automatically aligned by calculation without supervision and pose estimation. (reference document) ：X.Zhang,H.Luo,and X.Fan et al,"AlignedReID:Surpassing Human-Level Performance in Person Re-Identification,"arXiv preprint arXiv:1711.08184,2017.)

In 2018 Sun et al proposed a Part-based Convolutional Baseline (PCB) method of uniform partitioning, discussing a better inter-block combination, and further proposed a REFINED PART Partitioning (RPP) method based on soft partitioning, with a focus mechanism to align the individual partial partitions. (reference document) ：Y.Sun,L.Zheng,Yi.Yang,Q.Tian,and S.Wang,"Beyond Part Models:Person Retrieval with Refined Part Pooling(and A Strong Convolutional Baseline),"in Proceedings of the European Conference on Computer Vision(ECCV),2018,pp.501-518.)

In 2018, wei et al proposed a GLAD (Global-local-ALIGNMENT DESCRIPTOR) network that, after extracting key points, divided pedestrians into three parts, namely a head, an upper body and a lower body, extracted local features and assisted Global features. (reference document) ：L.Wei,S.Zhang,H.Yao,W.Gao,and Q.Tian."GLAD:Global-Local-Alignment Descriptor for Scalable Person Re-Identification,"IEEE Transactions on Multimedia,2018,pp.1-1.)

Notably, these local feature methods are proposed for the problem of image-based pedestrian re-recognition, and the transformation and application of these methods to the problem of video pedestrian re-recognition is a valuable research direction.

A typical video-based pedestrian re-recognition system consists of an image-level feature extractor and an extraction module for aggregating temporal features. Most recent video-based human ReID methods are based on deep neural networks, and research work has focused mainly on the part of temporal modeling, i.e. how to aggregate a series of image-level features into video-level features. The Gao Jiyang doctor's study showed that the method of time series attention weighting (temporal attention, TA) was most accurate with other modules consistent. (reference ：J.Gao and R.Nevatia,"Revisiting Temporal Modeling for Video-based Person ReID,"in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2018.) so the invention adopts a time sequence modeling framework of time sequence attention, introduces local features in an image-level feature extractor, splices the features of different scales and performs feature fusion according to a time sequence attention mechanism, thereby improving the accuracy of video pedestrian re-recognition.

Disclosure of Invention

The method aims to solve the problem that complex image-level apparent features are difficult to perform effective time sequence fusion in a video pedestrian re-recognition task, and the video-level pedestrian re-recognition features generated by processing the multi-scale features of an image sequence through a time sequence module can have higher degree of distinction by providing a video pedestrian re-recognition network model based on multi-scale feature fusion and synchronously extracting the image-level pedestrian apparent features and time sequence attention weights of different scales.

The problem of video pedestrian re-recognition is mainly studied from the viewpoint of the complexity of the apparent features of image-level pedestrian re-recognition. Large-scale apparent features pay attention to global information of pedestrians, small-scale apparent features pay attention to local information of pedestrians, and therefore it is reasonable to infer that features with different scales can be effectively organized to provide richer feature information for re-recognition tasks, and accuracy of re-recognition is improved.

Based on a PCB-RPP local feature extraction method, experimental verification is carried out on the idea of multi-scale feature fusion, resNet is taken as a backbone network, two branches are arranged, global features and local features are respectively extracted, and finally, the global feature vectors and the local feature vectors are spliced into multi-scale feature vectors. As shown in Table 1, the accuracy of re-recognition of multi-scale features on the mark-1501 dataset is better than that of single-scale global and local features.

TABLE 1 Multi-scale feature fusion validation results

	mAP	Rank1	Rank5	Rank10
					ResNet50 global features	77.9	92.1	96.9	97.8
PCB-RPP local feature	79.1	91.8	97.1	98.0
					Multi-scale features	79.8	92.5	96.9	98.0

Based on the above consideration, the invention provides a video pedestrian re-recognition network model based on multi-scale feature fusion, which consists of a shared backbone network and three branches. The backbone network is modified on the basis of ResNet, the tail end of the backbone network is connected with three branches, namely a global feature branch, a local feature branch and a time sequence attention branch, global features, local features and time sequence attention weights are respectively extracted, a model splices the global feature vector and the local feature vector in each frame to obtain a multi-scale image-level feature vector, and finally the multi-scale feature vector of each frame is weighted and fused according to the time sequence attention weights to obtain a video-level pedestrian re-recognition vector.

The main content of the invention specifically comprises the following steps:

Step 1: video pedestrian re-identification network design based on multi-scale fusion

The designed video pedestrian re-identification network model based on multi-scale feature fusion is shown in fig. 1, and consists of a shared backbone network and three branches.

The backbone network cancels the downsampling operation in the last layer of residual structure on the basis of Resnet network, so that the size of the output characteristic diagram is doubled, and more sufficient division space is provided for extracting local characteristics.

Three branches are led out from a feature map obtained from the tail end of the backbone network and are respectively used for extracting global features, local features and time sequence information.

On the global feature branch, the feature map is subjected to convolution, normalization and pooling operation once to generate a group of 2048-dimensional global feature vectors.

On the local feature branches, the feature map is subjected to Bottleneck decoupling and then is subjected to soft division by a PCB-RPP algorithm, so that a group of 2048-dimensional local feature vectors are generated, wherein two local features respectively occupy 1024 dimensions.

On the time sequence attention branch, the feature diagram sequentially passes through time domain convolution and space domain convolution to generate a time sequence attention score of the length of the input picture sequence, and the time sequence weight required by time sequence fusion is obtained.

Furthermore, the header of the partial branch adds a two-layer Bottleneck structure, bottleneck is the basic residual structure of ResNet, as shown in fig. 2. The structure is added at the front end of the local feature branch, so that the coupling between the global feature and the local feature can be reduced, otherwise, the two branches are directly led out from the output end of the backbone network at the same time, and the network is difficult to converge during training. The Bottleneck structure is chosen because it can be deep enough to decouple the features, while the residual structure greatly reduces the computational load of network deepening.

Splicing the global feature vector and the local feature vector of each frame obtained by the network global feature branch and the local feature branch to generate 4096-dimensional single-frame fusion features; and then carrying out weighted average on the time sequence attention scores of different frames obtained by time sequence attention branches to obtain a final 4096-dimensional video-level pedestrian re-recognition feature vector.

Step 2: multi-feature independent training strategy design

Because the feature vector finally generated by the network model is formed by splicing and fusing a plurality of feature vectors, the fused feature vector is divided and independently trained to ensure the training effect of multiple features.

(1) Classifier design

In the training stage, a classifier is independently arranged for each spliced part in the time sequence fused feature vector output by the model, namely, the features of each scale are independently trained, and the classifier parameters are not shared. Wherein the classifier is a fully connected layer of the neural network.

(2) Loss function

For each scale feature, the training loss function is composed of two parts, as shown in formula (1).

Loss_i＝Loss_{cross entropy}+Loss_triplet (1)

Where Loss _{cross entropy} and Loss _triplet represent cross entropy Loss functions and triplet Loss functions, respectively.

The final loss function is obtained by summing the loss functions of the partial features as shown in equation (2).

Wherein N represents the number of features before stitching, the invention uses one global feature and two local features, so N is 3.

(3) Training method

Because the local branches are characterized according to the PCB-RPP idea, the training of the model is divided into two stages: the first stage, the local feature branch adopts a hard division mode to uniformly divide the feature map into an upper local feature and a lower local feature; the training of the second stage is performed on the basis of convergence of the training of the first stage, that is, a classifier is used instead of the uniform division method of the first stage, and each point on the feature map is assigned to each local feature in the form of probability.

Furthermore, in both training phases, all parameters of the network model participate in the iteration.

Step 3: network model structural parameter optimization

And (3) performing a comparison experiment on the influence of three parameters, namely the number of local features, the size of the local features and the number Bottleneck on the model performance, and performing training and testing on the Mars data set.

And carrying out experimental optimization according to the sequence of the local feature quantity, the local feature size and the Bottleneck quantity, and fixing an optimization result of each parameter after each parameter is optimized to enter an optimization experiment of the next parameter.

Drawings

FIG. 1 is a diagram of a video pedestrian re-recognition network model based on multi-scale feature fusion.

Fig. 2 is a schematic diagram of Bottleneck structures.

Fig. 3 is an example of a data set sample used in an experiment.

Fig. 4 is a visual thermodynamic diagram of the invention for feature extraction of an image sequence.

Detailed Description

The technical scheme, experimental method and test result of the present invention will be described in further detail with reference to the accompanying drawings and specific experimental embodiments.

The experimental procedure is specifically described below.

Step one: and constructing a three-branch convolutional neural network, inputting a training set sample into the network for training, observing the training condition, and continuously iterating to obtain a training model.

Step two: and testing according to the training result, searching a pedestrian image sequence with the same id as each query image sequence in the query from a gallery library to form a result sequence, and simultaneously calculating to obtain a corresponding evaluation index.

Step three: and (3) carrying out a comparison experiment on the network structure parameters according to the evaluation index, and determining the optimal network structure parameters.

The experimental conditions and conclusions drawn are described in detail below.

3.1 Pedestrian Re-identification data set and evaluation index

The test dataset and evaluation index used in the ReID experiments are described next. As shown in FIG. 3, the proposed method of the present invention was tested on two large public data sets, market-1501 and Mars. The mark-1501 comprises 1501 pedestrians shot by 6 cameras and 32668 detected pedestrian rectangular frames, the training set comprises 751 people and 12,936 images, and 17.2 training data are obtained for each person on average; the test set had 750 people, containing 19732 images, with an average of 26.3 test data per person. Mars is based on a ReID maximum data set of video, and a training set comprises 8298 small-segment tracks of 625 pedestrians, and 509914 pictures are included; the test set had 12180 small tracks of 636 pedestrians, containing 681089 pictures.

In the task of re-identifying pedestrians, the test procedure is generally to give a set of images (query) to be queried (video ReID is given), then calculate the similarity with the images in the candidate set (gallery) according to a model, and then arrange the images into a sequence from large to small according to the similarity, wherein the earlier images are closer to the query image. In order to evaluate the performance of pedestrian re-recognition algorithms, it is currently the practice to calculate the corresponding index on the public data set and then compare it to other models. CMC curves (cumulove MATCHING CHARACTERISTICS) and mAP (MEAN AVERAGE Precision) are the two most commonly used evaluation criteria.

In the experiment, the most commonly used rank-1, rank-5 and mAP indexes in the CMC curve are mainly selected, wherein rank-k refers to the probability that the k images which are the forefront (with highest confidence) in the search results have correct results, the mAP indexes are actually equivalent to an average level, and the higher the mAP is, the more front the query result of the same person as the query is in the whole sequencing list, and the better the model effect is.

3.2 ReID Experimental major parameter configuration

The specific training parameters are as follows:

The learning rate decay strategy uses an lr_schedule.steplr function, takes 0.0003 as an initial learning rate, and decays by one tenth before 100 epochs learning rates are trained; the length of the video fragment sequence is set to be 4, and the video fragment sequence is randomly selected from the data set; the batch size was set to 32; each of the PCB stage and the RPP stage trains 400 epochs.

3.3 Re-identification of network Experimental results

Based on the evaluation index and experimental details, testing is performed based on Mars test, and a comparison experimental result of each parameter is obtained.

(1) Number of local features

Other parameters in the experiment were configured as follows: the global feature vector size is 2048 and the local feature vector size is 2048.

As shown in table 2, the two local features have optimal performance, and when the number of local features is increased, the feature scale is reduced, and for fine-granularity local features, the four limbs of a person change greatly when walking, and the time sequence weighting fusion can cause local information blurring.

TABLE 2 influence of local feature quantity on Performance

Quantity of	mAP	Rank-1	Rank-5
				2	75.0	82.0	93.8
3	73.4	81.1	92.9
				4	71.3	79.1	92.2

(2) Local feature size

Other parameters in the experiment were configured as follows: the number of local features is 2, the number of Bottleneck is 1, and the length of the global feature vector is 2048.

The test results are shown in table 3, and the performance is obviously improved after the size of the local feature is reduced by half, which indicates that the global feature has a greater influence on the re-identification performance.

TABLE 3 impact of local feature size on performance

Size of the device	mAP	Rank-1	Rank-5
				2048	75.0	82.0	93.8
1024	77.7	83.8	94.3

(3) Bottleneck number of

Other parameters in the experiment were configured as follows: the number of local features is 2, the global feature vector size is 2048, and the local feature vector size is 1024.

As shown in table 4, adding Bottleneck structures at the front end of the local feature branch can reduce the coupling between the global feature and the local feature, and the two-layer Bottleneck structure has the best performance, and the three-layer structure can make the network difficult to converge.

TABLE 4 influence of Bottlencek quantity on Performance

Quantity of	mAP	Rank-1	Rank-5
				0	77.1	82.7	93.8
1	77.7	83.8	94.3
				2	78.7	85.1	94.6
3	74.1	81.3	93.3

In summary, the model of the present invention performs optimally using two halved local features and providing two layers Bottleneck.

3.4 Feature extraction visualization

In order to observe whether the network global feature and the local feature branch extract features of different scales in an image sequence according to design expectations, class Activation Mapping algorithm (reference ：B.Zhou,A.Khosla,A.Lapedriza,A.Oliva and A.Torralba,"Learning Deep Features for Discriminative Localization,"in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2016,pp.2921-2929.) is used for visualizing sensitive areas of different features on the image sequence, as shown in fig. 4, the global feature performs feature extraction on the whole body of a person, and two local features are respectively focused on head and leg features, which means that features of each scale effectively extract features of different positions and different granularities of pedestrians.

3.5 Comparison with other methods

Under the condition that skills such as re-ranking are not used, on the Mars data set, the comparison of the method with the main stream method reaches a competitive level, and compared with baseline, mAP and Rank-1 indexes are respectively improved by 3.3% and 3.4% as shown in table 5.

Table 5 results of comparison with other methods

In summary, the invention provides a video pedestrian re-recognition model based on multi-scale feature fusion, and a time information processing module of the model adopts a feature aggregation method of time sequence attention, so that a single-frame feature extraction module can effectively extract multi-scale fusion features adaptive to the single-frame feature extraction module, and the multi-scale feature extraction module can effectively cooperate with a time information processing module while improving the distinguishing degree of the features. In addition, the number and the size of the local features in the model are subjected to a comparison experiment, so that the optimal local feature parameters under the algorithm frame of the invention are obtained. And the connection structure between the backbone network and the different scale feature extraction branches is optimized, and the coupling of the dependence of different branches on the backbone network is reduced. Finally, through testing, the invention has significantly improved performance over baseline and reaches a competitive level.

Claims

1. A video pedestrian re-identification method based on multi-scale feature fusion is characterized by comprising the following steps of:

Aiming at the problem that the traditional method has poor effect when time sequence fusion is carried out on complex apparent features, a video pedestrian re-recognition network model based on multi-scale feature fusion is provided, three branches are led out from the tail end of a backbone network, image-level re-recognition features and time sequence attention weights with different scales are respectively extracted, re-recognition feature vectors with different scales are spliced and fused according to the time sequence attention weights, and finally accurate pedestrian re-recognition is realized through a multi-feature independent training strategy, and structural parameters of the network are optimized through a comparison experiment;

The method specifically comprises the following steps:

Step1, video pedestrian re-identification network design based on multi-scale fusion

The designed video pedestrian re-identification network model based on multi-scale feature fusion consists of a shared backbone network and three branches, wherein the three branches are global feature branches, local feature branches and time sequence attention branches;

the shared backbone network cancels the downsampling operation in the last layer of residual structure on the basis of Resnet network, so that the size of the output characteristic diagram is doubled, thereby providing more sufficient division space for extracting local characteristics;

Three branches are led out from a feature map obtained from the tail end of the backbone network and are respectively used for extracting global features, local features and time sequence information; on the global feature branch, generating a group of 2048-dimensional global feature vectors after the feature map is subjected to convolution, normalization and pooling operation; on the local feature branches, after Bottleneck decoupling, a feature map is subjected to soft division by a PCB-RPP algorithm, namely a local convolution and finish pooling algorithm, so as to generate a group of 2048-dimensional local feature vectors, wherein two local features respectively occupy 1024 dimensions; on the time sequence attention branches, the feature images sequentially pass through time domain convolution and space domain convolution to generate time sequence attention scores of the length of the input picture sequences, and time sequence weights required by time sequence fusion are obtained;

Splicing the global feature vector and the local feature vector of each frame obtained by the network global feature branch and the local feature branch to generate 4096-dimensional single-frame fusion features; then, carrying out weighted average on the time sequence attention scores of different frames obtained by time sequence attention branches to obtain a final 4096-dimensional video-level pedestrian re-recognition feature vector;

step 2, multi-feature independent training strategy design

Because the feature vector finally generated by the network model is formed by splicing and fusing a plurality of feature vectors, the fused feature vector is divided and independently trained to ensure the training effect of multiple features;

And (3) classifier design: in the training stage, a classifier is independently arranged for each spliced part in the time sequence fused feature vector output by the model, namely, the features of each scale are independently trained, and the classifier parameters are not shared; the classifier is a full-connection layer of the neural network;

loss function: for each scale of the feature, the training loss function of the feature is composed of two parts, as shown in a formula (1);

Loss_i＝Loss_crossentropy+Loss_triplet (1)

Wherein Loss _crossentropy and Loss _triplet represent cross entropy Loss functions and triplet Loss functions, respectively;

The final loss function is obtained by summing the loss functions of the characteristics of all parts, as shown in a formula (2);

wherein, N represents the number of features before splicing, and N is 3 because the method uses one global feature and two local features;

The training method comprises the following steps: because the local branches are subjected to characteristic division according to the PCB-RPP thought, the training of the model is divided into two stages, and in the first stage, the local characteristic branches uniformly divide a characteristic diagram into upper and lower local characteristics by adopting a hard division mode; training in the second stage is performed on the basis of convergence of training in the first stage, namely, a classifier is used for replacing a uniform dividing method in the first stage, and each point on the feature map is distributed to each local feature in a probability form;

in addition, in the two training phases, all parameters of the network model participate in iteration;

Step 3, network model structural parameter optimization

Performing a comparison experiment aiming at the influence of three parameters of the local feature quantity, the local feature size and Bottleneck quantity on the model performance, and performing training and testing on a Mars data set;

And carrying out experimental optimization according to the sequence of the local feature quantity, the local feature size and the Bottleneck quantity, and keeping the optimization result of each parameter to enter a comparison experiment of the next parameter after each parameter is optimized.