CN114299542A

CN114299542A - Video pedestrian re-identification method based on multi-scale feature fusion

Info

Publication number: CN114299542A
Application number: CN202111635259.XA
Authority: CN
Inventors: 艾明晶; 刘鹏高
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-08
Anticipated expiration: 2041-12-29

Abstract

The invention discloses a video pedestrian re-identification method based on multi-scale feature fusion, and provides a video pedestrian re-identification network model based on multi-scale feature fusion, aiming at the problem that the effect of a traditional method is poor when time sequence fusion is carried out on complex apparent features. The model leads out three branches at the end of the backbone network: the method comprises the steps of extracting image-level weight recognition features and time sequence attention weights of different scales from a global feature branch, a local feature branch and a time sequence attention branch respectively, splicing weight recognition feature vectors of different scales, fusing according to the time sequence attention weights, realizing accurate pedestrian weight recognition through a multi-feature independent training strategy, and optimizing structural parameters of a network such as local feature quantity, local feature size and Bottleneck quantity through a comparison experiment. Experiments prove that the indexes of mAP and rank-1 of the invention respectively reach 78.7 percent and 85.1 percent on a Mars data set, and the invention is superior to most of the prior methods.

Description

Video pedestrian re-identification method based on multi-scale feature fusion

Technical Field

The invention relates to the field of computer vision and image processing, in particular to a video pedestrian re-identification method based on multi-scale feature fusion. The distinguishing degree of the image-level pedestrian apparent features is mainly improved through the global feature branches and the local feature branches of the video pedestrian re-recognition network model based on multi-scale feature fusion, the performance of the network model is improved through optimization of the number and the size of the local features and the number of the Bottleneck structures, the problem that time sequence fusion of complex image-level features is difficult to effectively carry out is solved, and finally the obtained video-level pedestrian re-recognition features can achieve higher re-recognition accuracy.

Background

As a biological identification technology, pedestrian Re-identification (ReiD) is different from unique identification such as face identification, iris identification and fingerprint identification, and mainly depends on the apparent characteristics of pedestrians, closely related to the appearance characteristics such as clothing and postures, and has wider application prospect.

The image-based pedestrian re-identification technology has excellent performance, but is poor in robustness in practical application scenes such as video monitoring and the like, and mismatching is easy to occur when the pedestrian flow environment is complex. Information of a single frame image is limited in general, so that the pedestrian re-identification based on the video has important research significance. A typical video-based pedestrian re-identification system consists of two parts, an image-level feature extractor (such as a convolutional neural network) and a modeling method to aggregate temporal features. The most central advantage of this method is to consider not only the content information of a single frame image, but also the motion information between frames, etc.

At present, the main research work of the pedestrian re-identification algorithm for the video is focused on the processing of time sequence information in an image sequence, and the effect of complex apparent features is poor when time sequence fusion is carried out, so that most of the used image-level features are global features. The re-recognition feature discrimination is low, the introduced local features can be considered for optimization, and the introduced local features are effectively combined with the time sequence information extraction model to improve the algorithm performance.

The early pedestrian re-identification method utilizes the whole image to obtain a feature vector for identification, and mainly focuses on global features. With the increase of pedestrian data volume and the deepening of a network structure, the requirement on pedestrian re-identification performance can be met only by introducing local features, and several common local feature extraction modes comprise skeleton key point positioning, image blocking, pose correction and the like.

In 2016, Varior et al proposed a size Long Short-Term Memory network for pedestrian Re-identification, which was sent to the network by vertically cutting pictures, and the resulting features fused with local features of the input image blocks, and this method had a high degree of image alignment (refer to the documents R.Varior, B.Shuai, J.Lu, D.xu, and G.Wang, "A size Long Short-Term Memory Architecture for Human Re-identification," in Proceedings of the European Conference reference Vision. spring, Cham,2016, pp.135-153).

In 2017, Zhao et al proposed Spindle Net, 14 key points were located by a pedestrian, the posture was estimated based on the 14 key points, the human body was segmented into 7 regions, local features on different scales were extracted, and then a global feature was extracted by inputting the picture, and the two were fused. (reference: H.ZHao, M.Tian, and S.Sun et al, "Spindle Net: Person Re-identification with Human Body vector Guided feed composition and Fusion," in Proceedings of the IEEE Conference on Computer Vision and Pattern registration, 2017, pp.907-915.).

In 2017, Zhang et al propose AlignedReID, a local block alignment method beyond human expression, and after uniform blocking, all parts are automatically aligned by calculating the shortest path of a local block without supervision and attitude estimation. (reference document: X.Zhang, H.Luo, and X.Fan et al, "AlignedReID: coloring Human-Level Performance in Person Re-Identification," arXIv preprinting arXIv:1711.08184,2017.)

In 2018, Sun et al proposed a Part-based connected basic (PCB) method for uniform blocking, discussed a better inter-block combination approach, and further proposed a Referred Part Posing (RPP) method based on soft partitioning, aligning each local block with attention mechanism. (references: Y.Sun, L.ZHEN, Yi.Yang, Q.Tian, and S.Wang, "Beyond Part Models: Person Retrieval with referred Part Pooling (and A Strong connected basic)," in Proceedings of the European Conference on Computer Vision (ECCV), "2018, pp.501-518.)

In 2018, Wei et al proposed a Global-local-alignment descriptor (GLAD) network, which divides pedestrians into three parts, namely a head, an upper body and a lower body, after extracting key points, extracts local features and assists Global features. (reference: L.Wei, S.Zhang, H.Yao, W.Gao, and Q.Tian. "GLAD: Global-Local-Alignment Descriptor for Scalable Person Re-Identification," IEEE Transactions on Multimedia,2018, pp.1-1.)

It is noted that the methods of the local features are proposed for the image-based pedestrian re-identification problem, and the conversion and application of the methods to the video pedestrian re-identification problem is a valuable research direction.

A typical video-based pedestrian re-identification system consists of an image-level feature extractor and an extraction module that aggregates temporal features. Most recent video-based human ReID methods are based on deep neural networks and research efforts have focused mainly on the temporal modeling section, i.e. on how to aggregate a range of image-level features into video-level features. The results of the research by the university doctor show that the accuracy of the time series attention weighting (TA) method is the highest under the condition that other modules are consistent. The method adopts a time sequence attention time sequence Modeling frame, introduces local features into an image level feature extractor, splices the features of different scales and performs feature fusion according to the time sequence attention mechanism, and improves the accuracy of Video pedestrian re-identification.

Disclosure of Invention

The method aims to solve the problem that complex image-level apparent features are difficult to perform effective time sequence fusion in a video pedestrian re-identification task, and through providing a video pedestrian re-identification network model based on multi-scale feature fusion, image-level pedestrian apparent features and time sequence attention weights of different scales are synchronously extracted, so that the video-level pedestrian re-identification features generated after the multi-scale features of an image sequence are processed by a time sequence module can have higher discrimination.

The video pedestrian re-identification problem is mainly researched from the perspective of complexity of image-level pedestrian re-identification apparent features. The large-scale apparent features pay attention to the global information of the pedestrians, and the small-scale apparent features pay attention to the local information of the pedestrians, so that reasonable inference can be made that the features of different scales are effectively organized to provide richer feature information for the re-recognition task, and further the accuracy of re-recognition is improved.

Based on a PCB-RPP local feature extraction method, the idea of multi-scale feature fusion is experimentally verified, ResNet50 is used as a backbone network, two branches are arranged, global features and local features are respectively extracted, and finally, global feature vectors and local feature vectors are spliced into multi-scale feature vectors. As shown in Table 1, the re-identification accuracy of multi-scale features on the marker-1501 dataset is better than that of single-scale global and local features.

TABLE 1 Multi-Scale feature fusion validation results

	mAP	Rank1	Rank5	Rank10
					ResNet50 Global features	77.9	92.1	96.9	97.8
PCB-RPP local features	79.1	91.8	97.1	98.0
					Multiscale features	79.8	92.5	96.9	98.0

Based on the consideration, the invention provides a video pedestrian re-identification network model based on multi-scale feature fusion, which consists of a shared backbone network and three branches. The backbone network is modified on the basis of ResNet50, the tail end of the backbone network is connected with three branches, namely a global feature branch, a local feature branch and a time sequence attention branch, global features, local features and time sequence attention weights are respectively extracted, a model splices global feature vectors and local feature vectors in each frame to obtain multi-scale image-level feature vectors, and finally, the multi-scale feature vectors in each frame are weighted and fused according to the time sequence attention weights to obtain video-level pedestrian weight identification vectors.

The main content of the invention specifically comprises the following steps:

step 1: video pedestrian re-identification network design based on multi-scale fusion

The designed video pedestrian re-identification network model based on multi-scale feature fusion is shown in fig. 1 and consists of a shared backbone network and three branches.

The backbone network cancels the down-sampling operation in the last layer of residual error structure on the basis of the Resnet50 network, so that the size of the output feature map is enlarged to twice of the original size, and a more sufficient division space is provided for extracting local features.

Three branches are led out from a characteristic diagram obtained from the tail end of the backbone network and are respectively used for extracting global characteristics, local characteristics and time sequence information.

On the global feature branch, the feature map generates a group of 2048-dimensional global feature vectors after one convolution, normalization and pooling operation.

On the local feature branch, after being decoupled by Bottleneck, the feature graph is subjected to soft division by a PCB-RPP algorithm to generate a group of 2048-dimensional local feature vectors, wherein two local features respectively occupy 1024 dimensions.

And on the time sequence attention branch, the feature graph sequentially undergoes time domain convolution and space domain convolution to generate a time sequence attention score of the length of the input picture sequence, and the time sequence weight required by time sequence fusion is obtained.

In addition, the header of the local branch adds two layers of bottleeck structures, which are the basic residual structure of ResNet50, as shown in fig. 2. The structure is added at the front end of the local characteristic branch, so that the coupling between the global characteristic and the local characteristic can be reduced, otherwise, the two branches are directly led out from the output end of the backbone network at the same time, and the network is difficult to converge during training. The Bottleneck structure is selected because the structure can remove the coupling relation between the features with enough depth, and meanwhile, the residual structure greatly reduces the calculation load caused by network deepening.

Splicing the global feature vector and the local feature vector of each frame obtained by the network global feature branch and the local feature branch to generate 4096-dimensional single-frame fusion features; and then, carrying out weighted average according to the time sequence attention scores of different frames obtained by the time sequence attention branch to obtain the final 4096-dimensional video-level pedestrian re-identification feature vector.

Step 2: multi-feature independent training strategy design

Because the feature vector finally generated by the network model is formed by splicing and fusing a plurality of feature vectors, in order to ensure the training effect of multiple features, the feature vectors after fusion are divided and trained independently.

(1) Classifier design

And in the training stage, a classifier is independently arranged for each spliced part in the feature vector after time sequence fusion output by the model, namely the features of each scale are independently trained, and the parameters of the classifier are not shared. Wherein, the classifier is a full connection layer of the neural network.

(2) Loss function

For each scale feature, the loss function for training is composed of two parts, as shown in formula (1).

Loss_i＝Loss_{cross entropy}+Loss_triplet (1)

Therein, Loss_{cross entropy}And Loss_tripletRespectively representing a cross-entropy loss function and a triplet loss function.

The final loss function is obtained by summing the loss functions of the characteristics of each part, as shown in formula (2).

Where N represents the number of features before stitching, the present invention uses one global feature and two local features, so N is 3.

(3) Training method

Because the local branch is subjected to characteristic division according to the PCB-RPP thought, the training of the model is divided into two stages: firstly, uniformly dividing a characteristic diagram into an upper local characteristic and a lower local characteristic by a local characteristic branch in a hard division mode; the training of the second stage is carried out on the basis of the convergence of the training of the first stage, namely, a classifier is used to replace the uniform partitioning method of the first stage, and each point on the feature map is allocated to each local feature in a probability mode.

Furthermore, in both training phases, all parameters of the network model participate in the iteration.

And step 3: network model structure parameter optimization

And (3) carrying out comparison experiments on the influence of three parameters of the number of local features, the size of the local features and the number of Bottleneck on the performance of the model, and training and testing on a Mars data set.

Specifically, experiment optimization is performed according to the sequence of the local feature quantity, the local feature size and the Bottleneck quantity, each parameter is optimized, the optimization result is fixed, and the optimization experiment of the next parameter is performed.

Drawings

Fig. 1 is a diagram of a video pedestrian re-identification network model based on multi-scale feature fusion.

FIG. 2 is a schematic diagram of the structure of Bottleneck.

Fig. 3 is an example of a data set sample used in an experiment.

Fig. 4 is a visualization thermodynamic diagram for feature extraction of an image sequence according to the present invention.

Detailed Description

The technical scheme, the experimental method and the test result of the invention are further described in detail with reference to the accompanying drawings and specific experimental embodiments.

The experimental procedure is specifically described below.

The method comprises the following steps: and (3) constructing a three-branch convolutional neural network, inputting a training set sample into the network for training, observing the training condition, and continuously iterating to obtain a training model.

Step two: and testing according to the training result, searching a pedestrian image sequence with the same id as each group of query image sequence in the query from the galery library to form a result sequence, and simultaneously calculating to obtain a corresponding evaluation index.

Step three: and performing a comparison experiment on the network structure parameters according to the evaluation indexes to determine the optimal network structure parameters.

The experimental conditions and conclusions obtained are described in detail below.

3.1 pedestrian re-identification dataset and evaluation index

The test data sets and evaluation indices used in the ReID experiments are presented next. As shown in FIG. 3, the method provided by the invention is tested on two large public data sets, Market-1501 and Mars. The Market-1501 includes 1501 pedestrians shot by 6 cameras and 32668 detected rectangular frames of the pedestrians, the training set comprises 751 persons and 12,936 images, and on average, each person has 17.2 pieces of training data; the test set had 750 people, contained 19732 images, and on average 26.3 test data per person. Mars has a 8298 small track of 625 pedestrians and comprises 509914 pictures based on the data set with the maximum ReID of the video; the test set had 12180 small traces of 636 pedestrians, containing 681089 pictures.

In the task of pedestrian re-identification, a test process is generally to give a (video ReID gives a group of) image to be queried (query), then calculate similarity between the image and images in a candidate set (galery) according to a model, and arrange the images into a sequence from large to small according to the similarity, wherein the closer the former image is to the queried image. In order to evaluate the performance of the pedestrian re-identification algorithm, the current practice is to calculate the corresponding index on the public data set, and then compare with other models. CMC curves (relational Material metrics) and mAP (mean Average precision) are the two most commonly used evaluation criteria.

In the experiment, rank-1, rank-5 and mAP indexes which are most commonly used in the CMC curve are mainly selected, wherein rank-k refers to the probability that the top k-sheet (with the highest confidence) in the search results has correct results, and the mAP index is actually equivalent to an average level, and the higher the mAP is, the higher the query results which are the same person as the query are in the whole ranking list is, the higher the model effect is.

3.2 Primary parameter configuration for ReiD experiments

The specific training parameters are as follows:

the learning rate attenuation strategy uses an lr _ schedule.steplr function, takes 0.0003 as an initial learning rate, and attenuates the learning rate to one tenth of the previous learning rate every 100 epochs of training; the length of the video clip sequence is set to be 4, and the video clip sequence is randomly selected from the data set; the batch size is set to 32; each of the PCB stage and RPP stage trains 400 epochs.

3.3 re-identification of network experiment results

Based on the evaluation indexes and the experimental details, the test is carried out based on a Mars test, and a comparative experimental result of each parameter is obtained.

(1) Local feature quantity

Other parameters in the experiment were configured as follows: the global feature vector size is 2048 and the local feature vector size is 2048.

The test result is shown in table 2, the two local features have the best performance, when the number of the local features is increased, the feature scale is reduced, for the local features with fine granularity, the change of limbs of a person is large when the person walks, and the temporal weighting fusion can cause the blurring of local information.

TABLE 2 influence of local feature quantity on Performance

Number of	mAP	Rank-1	Rank-5
				2	75.0	82.0	93.8
3	73.4	81.1	92.9
				4	71.3	79.1	92.2

(2) Local feature size

Other parameters in the experiment were configured as follows: the number of local features is 2, the number of bottletech is 1, and the length of the global feature vector is 2048.

The test results are shown in table 3, and after the size of the local features is reduced by half, the performance is obviously improved, which indicates that the global features have a larger influence on the re-recognition performance.

TABLE 3 influence of local feature size on Performance

Size of	mAP	Rank-1	Rank-5
				2048	75.0	82.0	93.8
1024	77.7	83.8	94.3

(3) Number of bottleeck

Other parameters in the experiment were configured as follows: the number of local features is 2, the global feature vector size is 2048, and the local feature vector size is 1024.

The test result is shown in table 4, the coupling between the global feature and the local feature can be reduced by adding the bottleeck structure at the front end of the local feature branch, the two layers of the bottleeck structure perform best, and the three layers of the bottleeck structure make it difficult for the network to converge.

TABLE 4 Bottlenecek number Effect on Performance

Number of	mAP	Rank-1	Rank-5
				0	77.1	82.7	93.8
1	77.7	83.8	94.3
				2	78.7	85.1	94.6
3	74.1	81.3	93.3

In summary, the model performance of the present invention is optimal with two local features, reduced in size by half, and two layers of bottletech.

3.4 feature extraction visualization

In order to observe whether the network global Features and the local feature branches are expected to extract Features of different scales in the image sequence according to the design, the invention uses a Class Activation Mapping algorithm (refer to documents B.Zhou, A.Khosla, A.Lapedriza, A.Oliva and A.Torralba and a "Learning Deep feeds for recognizing Localization and a" in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016 and pp.2921-2929.) to visualize the sensitive areas of different Features on the image sequence, and as a result, as shown in FIG. 4, the global Features extract Features of the whole human body, while the two local Features focus on the head and leg Features respectively, which shows that the Features of each scale effectively extract Features of different positions and different granularities of pedestrians.

3.5 comparison with other methods

Under the condition of not using skills such as re-ranking and the like, on a Mars data set, the method achieves a competitive level in comparison with a mainstream method, and as shown in Table 5, compared with baseline, the indexes of mAP and Rank-1 are respectively improved by 3.3% and 3.4%.

Table 5 comparison with other methods

In summary, the invention provides a video pedestrian re-identification model based on multi-scale feature fusion, a time information processing module of the model adopts a time sequence attention feature aggregation method, and a single-frame feature extraction module can effectively extract multi-scale fusion features adapted to the single-frame feature extraction module, so that the discrimination of the features is improved, and the single-frame feature extraction module can effectively cooperate with the time information processing module. In addition, the number and the size of the local features in the model are compared and tested, and the optimal local feature parameters under the algorithm framework are obtained. And the connection structure between the backbone network and the feature extraction branches with different scales is optimized, and the dependence coupling of different branches on the backbone network is reduced. Finally, the performance of the present invention was significantly improved compared to baseline by testing and reached a competitive level.

Claims

1. A video pedestrian re-identification method based on multi-scale feature fusion is characterized by comprising the following steps:

aiming at the problem that the traditional method has poor effect when time sequence fusion is carried out on complex apparent characteristics, a video pedestrian weight recognition network model based on multi-scale characteristic fusion is provided, three branches are led out from the tail end of a backbone network of the model, image-level weight recognition characteristics and time sequence attention weights of different scales are respectively extracted, weight recognition characteristic vectors of different scales are spliced and fused according to the time sequence attention weights, accurate pedestrian weight recognition is realized through a multi-feature independent training strategy, and structural parameters of the network are optimized through a comparison experiment;

the method specifically comprises the following steps:

step 1, video pedestrian re-identification network design based on multi-scale fusion

The designed video pedestrian re-identification network model based on multi-scale feature fusion is composed of a shared backbone network and three branches, wherein the three branches are a global feature branch, a local feature branch and a time sequence attention branch;

the shared backbone network cancels the down-sampling operation in the last layer of residual error structure on the basis of the Resnet50 network, so that the size of the output characteristic diagram is expanded to two times of the original size, and a more sufficient division space is provided for extracting local characteristics;

three branches are led out from a characteristic diagram obtained from the tail end of a backbone network and are respectively used for extracting global characteristics, local characteristics and time sequence information; on the global feature branch, after the feature map is subjected to one convolution, normalization and pooling operation, a group of 2048-dimensional global feature vectors is generated; on the local feature branch, after being decoupled by Bottleneck, the feature graph is subjected to soft division by a PCB-RPP algorithm, namely a local convolution and refinement pooling algorithm, so as to generate a group of 2048-dimensional local feature vectors, wherein two local features respectively occupy 1024 dimensions; on the time sequence attention branch, the feature graph sequentially passes through time domain convolution and space domain convolution to generate a time sequence attention score of the length of the input picture sequence and obtain a time sequence weight required by time sequence fusion;

splicing the global feature vector and the local feature vector of each frame obtained by the network global feature branch and the local feature branch to generate 4096-dimensional single-frame fusion features; carrying out weighted average according to the time sequence attention scores of different frames obtained by the time sequence attention branch to obtain a final 4096-dimensional video-level pedestrian re-identification feature vector;

step 2, designing a multi-feature independent training strategy

Because the feature vector finally generated by the network model is formed by splicing and fusing a plurality of feature vectors, in order to ensure the multi-feature training effect, the feature vectors after fusion are divided and trained independently;

designing a classifier: in the training stage, a classifier is independently arranged for each spliced part in the feature vectors which are output by the model and fused by the time sequence, namely, the features of each scale are independently trained, and the parameters of the classifier are not shared; wherein, the classifier is a full connection layer of the neural network;

loss function: for the feature of each scale, the loss function for training is composed of two parts, as shown in formula (1);

Loss_i＝Loss_crossentropy+Loss_triplet (1)

therein, Loss_crossentropyAnd Loss_tripletRespectively representing a cross entropy loss function and a triplet loss function;

the final loss function is obtained by summing the loss functions of all the parts of the characteristics, as shown in the formula (2);

wherein N represents the number of features before splicing, and N is 3 because the method uses one global feature and two local features;

the training method comprises the following steps: the local branch is subjected to feature division according to the PCB-RPP idea, so that the training of the model is divided into two stages, and in the first stage, the local feature branch firstly adopts a hard division mode to uniformly divide a feature map into an upper local feature and a lower local feature; the training of the second stage is carried out on the basis of the convergence of the training of the first stage, namely, a classifier is used for replacing a uniform division method in the first stage, and each point on the feature map is distributed to each local feature in a probability mode;

in addition, in the two training stages, all parameters of the network model participate in iteration;

step 3, optimizing the structural parameters of the network model

Performing a comparison experiment aiming at the influence of three parameters of the number of local features, the size of the local features and the number of Bottleneck on the performance of the model, and training and testing on a Mars data set;

specifically, experiment optimization is carried out according to the sequence of the local feature quantity, the local feature size and the Bottleneck quantity, and after each parameter is optimized, the optimization result is kept to enter a comparison experiment of the next parameter.