CN113221787A

CN113221787A - Pedestrian multi-target tracking method based on multivariate difference fusion

Info

Publication number: CN113221787A
Application number: CN202110556574.7A
Authority: CN
Inventors: 韩红; 迟勇欣; 张齐驰; 王毅飞; 范迎春
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2021-08-06
Anticipated expiration: 2041-05-18
Also published as: CN113221787B

Abstract

The invention provides a pedestrian multi-target tracking method based on multivariate difference fusion, which comprises the following steps of: (1) acquiring a training sample set and a test sample set; (2) constructing a detection and re-identification integrated network model based on multi-element difference fusion; (3) performing iterative training on a detection and re-recognition integrated network model based on multivariate difference fusion; (4) and acquiring a multi-target tracking result of the pedestrian. According to the method, when the integrated network model for detecting and re-identifying based on the multivariate difference fusion is constructed, the difference of training data, a training mode and a network structure is added, so that the two key point heat map prediction sub-networks form prediction preference on targets with different sizes, the prediction results of the two sub-networks are added and fused to obtain the multivariate difference fusion key point heat map, the problem of low detection recall rate caused by prediction of the sub-networks by only using a single key point heat map in the prior art is solved, and the tracking accuracy of the algorithm is improved.

Description

Pedestrian multi-target tracking method based on multivariate difference fusion

Technical Field

The invention belongs to the technical field of computer vision, and relates to a pedestrian multi-target tracking method based on multivariate difference fusion, which can be used for monitoring pedestrian multi-target tracking tasks in the fields of security protection, video content understanding, human-computer interaction and the like.

Background

The pedestrian multi-target tracking algorithm is widely applied to the fields of security monitoring, video content understanding, man-machine interaction, intelligent nursing and the like. In recent years, with the rise and popularization of deep learning, the pedestrian multi-target tracking algorithm gradually forms an algorithm paradigm of combining three basic modules of target detection, re-recognition feature extraction and data association. The object detection module is used for detecting all pedestrian objects in a positioning scene, the re-identification feature extraction module is used for extracting and coding pedestrian appearance information, and the data association module estimates the similarity between a historical track and a detected pedestrian in a current frame according to the information provided by the detection and re-identification feature extraction module and performs optimal association matching according to the similarity so as to form the track.

Yufu Zhang et al, in 2020, "FairMOT" published by IEEE Conference On Computer Vision and Pattern Recognition, discloses a multi-target Tracking algorithm integrating Detection and Re-Identification tasks into a network, which adds a Re-Identification feature extraction sub-network On the CenterNet Detection network to make the Detection and Re-Identification tasks share a large number of convolutional layer parameters and features, thereby reducing the number of network parameters and calculation, improving the execution efficiency of the system, and achieving good results in the balance of speed and precision.

However, in the FairMOT algorithm, only the weight extraction tasks which can be detected and identified again are simply integrated, and the four prediction task branch sub-networks only share one fusion feature map, so that the intense competition of features among tasks is caused, and further learning of each task is inhibited; in addition, the FairMOT algorithm ignores the characteristic difference between targets with larger scale difference under the scene with larger scale difference of the targets, only one target central point heat map prediction sub-network is adopted to detect the targets with all scales to be recalled, although the convolutional neural network has the capacity of learning and adapting to the changes of scale, texture and the like, for the learning between the targets with larger difference, the network often seeks a balance between the targets, so that the detection recall effect of the model on the pedestrian targets is inhibited, and the accuracy of multi-target tracking is reduced.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a pedestrian multi-target tracking method based on multi-element difference fusion, which is used for solving the technical problem of low detection recall rate in the scene with large target scale difference in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

(1) obtaining a training sample set D_trainAnd test sample set D_test：

(1a) Preprocessing the selected V RGB image sequences with the pedestrian detection frame labels and the identity labels to obtain a preprocessed RGB image frame sequence set

And mixing S_vTaking RGB image frames contained in the I preprocessed RGB image frame sequences as a training sample set D_trainTaking the rest K preprocessed RGB image frame sequences as a test sample set D_testWherein

Denotes the mth one containing L_mFrame pre-processed RGB imageA sequence of frames comprising a sequence of frames,

f⁽ⁿ⁾representing the n-th preprocessed RGB image frame, I > K, I + K ═ V, V > 20, L_m＞200；

(2) Constructing a detection and re-identification integrated network model O based on multivariate difference fusion:

(2a) constructing a structure of a detection and re-identification integrated network model O based on multi-element difference fusion:

structure for constructing detection and re-identification integrated network model O, including backbone network Net_backboneAnd Net_backboneCascaded parallel-arranged and same-structure first feature fusion sub-network A_sAnd a second feature fusion sub-network A_lAnd a convergence module, wherein the first feature converges the subnetwork A_sThe output end of the network is connected with a parallelly arranged key point deviation prediction sub-network Net_offsetSmall target preference keypoint heatmap prediction subnetwork Net_{hm_s}And bounding Box prediction subnetwork Net_bboxSecond feature fusion subnetwork A_lThe output end of the network is connected with a large target preference key point heat map prediction sub-network Net which is arranged in parallel_{hm_l}And re-identifying feature extraction sub-network Net_reidWherein:

backbone network Net_backboneAdopting a tree-shaped polymerization iterative network consisting of a plurality of two-dimensional convolution layers, a plurality of batch normalization layers, a plurality of two-dimensional pooling layers, a plurality of deformable convolution layers and a plurality of transposition convolution layers;

first feature fusion subnet A_sAnd a second feature fusion sub-network A_lEach comprising a plurality of spatial attention sub-networks Net_samAnd a channel attention subnetwork Net_cam(ii) a Spatial attention subnetwork Net_samComprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, and a two-dimensional convolution layer connected with the two pooling layers, a channel attention sub-network Net_camComprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, wherein the global average pooling layer and the global maximum pooling layer are respectively connected with two in a cascade wayA two-dimensional convolution layer; net_offset、Net_{hm_s}、Net_bbox、Net_{hm_l}And Net_reidThe sub-networks all adopt a structure comprising a first convolutional layer, a rule active layer and a second convolutional layer which are sequentially cascaded, and Net_reidThe output end is cascaded with a full connection layer, Net_{hm_s}And Net_{hm_l}The output end is cascaded with the fusion module;

(2b) defining a loss function L for a keypoint heat map prediction task_heatmap：

Where N represents the number of keypoints in the predicted keypoint heat map, alpha and beta represent hyper-parameters,

and Y_xyRespectively representing the labels and response values of key points at coordinates (x, y) in the predicted key point heat map, sigma representing summation operation, and log representing logarithm operation;

(3) carrying out iterative training on a detection and re-recognition integrated network model O based on multivariate difference fusion:

(3a) initializing detection and re-identifying integrated network model O with weight parameter theta_JThe iteration frequency is T, the maximum iteration frequency is T, T is more than or equal to 50000, and T is made to be 0;

(3b) for slave training sample set D_trainIn the random selection of bs ∈ [16,64 ]]Random data enhancement is carried out on each training sample, detection frame information of each training sample is updated according to an enhancement mode, bs data enhancement training samples with updated detection frame information are obtained, and the ratio of the height in the detection frame with updated information to the height of the image frame is larger than a threshold th_ratioIs taken as a large target, and the ratio is less than a threshold th_ratioThe pedestrian target is used as a small target, and finally, the small target preference key point heat map label is determined according to the updated detection frame information, the updated identity information and the division result of the large target and the small target

Large target preference key point heat map label

Difference fusion key point heat map label

Bounding box label_bboxKey point offset label_offsetRe-identification identity label_id；

(3c) Using the bs training samples after data enhancement as the input of the detection and re-recognition integrated network model O, namely the backbone network Net_backboneExtracting features of each training sample to obtain three feature maps Feat with different scales of the training sample₁、Feat₂、Feat₃；

(3d) First feature fusion subnet A_sFor Feat₁、Feat₂、Feat₃Performing self-adaptive fusion to obtain a feature map Feat_sThe keypoint shift prediction subnetwork Net_offsetSmall target preference keypoint heatmap prediction subnetwork Net_{hm_s}And bounding Box prediction subnetwork Net_bboxRespectively with Feat_sForward reasoning is carried out for input to obtain Net_offsetCorresponding keypoint offset predictor vector Vec_offset、Net_{hm_s}Corresponding keypoint heatmap predictions Hm _ S and Net of small target preferences_bboxCorresponding distance value vectors Vec from the key points to the upper, lower, left and right sides of the target frame_{dis_bbox}(ii) a Simultaneous second feature fusion subnetwork A_lFor Feat₁、Feat₂、Feat₃Performing self-adaptive fusion to obtain a feature map Feat_lLarge target preference keypoint heat map prediction subnetwork Net_{hm_l}And re-identifying feature extraction sub-network Net_reidRespectively with Feat_lFor input, forward reasoning is carried out to obtain Net_{hm_l}Corresponding big target preferred key point heatmap prediction results Hm _ L and Net_reidCorresponding re-identified feature vector Vec_reidIs totally connected toJoining layer pair Vec_reidClassifying to obtain a pedestrian identity classification result; the fusion module fuses the Hm _ S key point heat map and the Hm _ L key point heat map to obtain a fused key point heat map Hm;

(3e) using L1 loss function, through inputting the predicted value of the shift of the key point and the label thereof_offsetCalculating a loss value L of the keypoint shift prediction result_offSimultaneously inputting the predicted value of the bounding box and label thereof_bboxCalculating a loss value L for a bounding box predictor_bboxAnd adopting cross entropy loss function, through inputting pedestrian identity classification result and label thereof_idCalculating loss value L of re-recognition feature extraction result_reidThen, a loss function L of the task is predicted by adopting the key point heat map_heatmapRespectively inputting Hm _ S, Hm _ L and Hm and corresponding label

Calculating respective loss values L_{hm_s}、L_{hm_l}、L_hmFinally to L_off、L_bbox、L_reid、L_{hm_s}、L_{hm_l}And L_hmCarrying out self-adaptive weighted summation to obtain a loss value L of the detection and re-identification integrated network model O_total；

(3f) Using a back propagation method and passing through the loss value L_totalCalculating the gradient of the weight parameter of the detection and re-identification integrated network model O, and then adopting a gradient descent algorithm to carry out the pair of the weight parameter theta through the gradient of the weight parameter of O_JUpdating is carried out;

(3g) judging whether T is greater than T, if so, obtaining a trained detection and re-recognition integrated network model O', otherwise, making T equal to T +1, and executing the step (3 b);

(4) acquiring a multi-target tracking result of the pedestrian:

(4a) initializing test sample set D_testThe kth test specimen is

Comprises P RGB image frames, the P-th RGB image frame is f^(p)P is more than 200, k is equal to 1, and the historical track set Tra is initialized^(k)＝{}；

(4b) Let p be 1;

(4c) set of test samples

P-th RGB image frame f in (1)^(p)Forward propagation as input to a trained detection and re-recognition integrated network model O' to yield f^(p)Is predicted value Vec of the key point offset_offsetDistance values Vec from the periphery of the target frame to the key points up, down, left and right_{dis_bbox}Key point heat map prediction result Hm and re-recognition feature vector Vec_reidAnd to Vec_offset、Vec_{dis_bbox}And Hm to obtain f^(p)Set of pedestrian detection frames Det ═ { Det ═ Det_iI is more than or equal to |0 and less than or equal to DN-1}, wherein det_iA detection box for the ith pedestrian, DN denotes f^(p)Detecting the number of pedestrians;

(4d) screening out f^(p)The detected key point response value conf_iGreater than a response threshold th_confObject of the pedestrian Object { Object ═ Object }_i|conf_iIs greater than th, i is greater than or equal to 0 and is less than or equal to DN-1, and is collected in the Det and Vec_reidObtaining a detection frame and re-identification feature vector information corresponding to the pedestrian target from the vector;

(4e) according to the detection frame and the re-recognition characteristic vector information of the screened pedestrian target, adopting an online correlation method to carry out on-line correlation on the Object and the Tra in the screened pedestrian target set^(k)Carrying out data association to obtain f^(p)The multi-target tracking result of the pedestrians is obtained;

(4f) judging whether P is more than or equal to P, if so, obtaining a test sample

Otherwise, making p equal to p +1, and updating the historical track set Tra^(k)And performStep (4 c);

(4g) judging whether K is more than or equal to K, if so, obtaining a test sample set D_testOtherwise, k is made to be k +1, and step (4b) is executed.

Compared with the prior art, the invention has the following advantages:

1. when a detection and re-recognition integrated network model based on multi-element difference fusion is constructed, two feature fusion sub-networks which are arranged in parallel are respectively cascaded with a key point heat map prediction sub-network, a training mode difference and a training data difference are added through designing a loss function and a training mode, so that the two key point heat map prediction sub-networks form prediction preference on targets with different sizes, the difference results of the two sub-networks are added and fused to obtain the multi-element difference fusion key point heat map.

2. In addition, after a plurality of prediction sub-networks are separated and respectively cascaded to two feature fusion sub-networks which are arranged in parallel, the competition degree of the features among the plurality of prediction tasks is reduced, the network structure difference is added to a key point heat map of multi-element difference fusion, and the tracking accuracy of the algorithm is further improved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Fig. 2 is a schematic structural diagram of the integrated network for detecting and re-identifying based on multivariate difference fusion according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments.

Referring to fig. 1, the present invention includes the steps of:

step 1) obtaining a training sample set D_trainAnd test sample set D_test：

Step 1a) preprocessing the selected V RGB image sequences with pedestrian detection frame labels and identity labels to obtain a preprocessed RGB image frame sequence set

Denotes the mth one containing L_mA sequence of frame pre-processed RGB image frames,

f⁽ⁿ⁾representing the n-th preprocessed RGB image frame, I > K, I + K ═ V, V > 20, L_mIs more than 200; in the example, crowdHuman, ETH, CityPerson, CalTech, CUHK-SYSU, PRW and MOT17train data sets with rich scenes are used as training data sets, the generalization capability of the model is improved, and the MOT17test data set with 7 image sequences in different scenes and the average frame number of the sequences of 845 is used for testing, so that the tracking accuracy is reasonably tested.

The method comprises the following steps of selecting V RGB image sequences with pedestrian detection frame labels and identity labels, preprocessing the selected V RGB image sequences, and realizing the following steps:

(1a1) adjusting the size of each RGB image frame in each RGB image sequence by a bilinear interpolation method to obtain an RGB image frame sequence set S with all RGB image frames of 608 × 1088_v' so as to be consistent with the network input size.

(1a2) Assembling a sequence of RGB image frames S_vIn the method, the pedestrian detection frame label and the image after the scale change are updated synchronously, and meanwhile, the pedestrian identity label is uniformly coded, namely, the identity label of the data sample with the missing identity information is setSetting the number as-1, sequentially increasing the coding mark from 1 for each pedestrian with single identity to obtain an RGB image frame sequence set which adjusts the size of an RGB image frame and updates a detection frame label and an identity label

The method has the advantages that the V RGB image sequences with the pedestrian detection frame labels and the identity labels are preprocessed, and the consistency of the pedestrian identities and the consistency of the labels and pictures in the training or testing process are guaranteed.

Step 2), constructing a detection and re-identification integrated network model O based on multivariate difference fusion:

first feature fusion subnet A_sAnd a second feature fusion sub-network A_lEach comprising a plurality of spatial attention sub-networks Net_samAnd a channel attention subnetwork Net_cam(ii) a Spatial attention subnetwork Net_samIncluding global planes arranged in parallelAn average pooling layer and a global maximum pooling layer, and a two-dimensional convolutional layer connected to the two pooling layers, a channel attention subnetwork Net_camThe system comprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, wherein the global average pooling layer and the global maximum pooling layer are respectively connected with two-dimensional convolutional layers in a cascade mode; net_offset、Net_{hm_s}、Net_bbox、Net_{hm_l}And Net_reidThe sub-networks all adopt a structure comprising a first convolutional layer, a rule active layer and a second convolutional layer which are sequentially cascaded, and Net_reidThe output end is cascaded with a full connection layer, Net_{hm_s}And Net_{hm_l}The output end is cascaded with the fusion module;

wherein the structure of the integrated network model O is detected and re-identified based on the multivariate difference fusion, wherein:

backbone network Net_backboneThe number of the contained two-dimensional convolution layers, the batch normalization layer, the two-dimensional pooling layer, the deformable convolution layer and the transposition convolution layer is respectively 27, 37, 6, 4 and 2; the network is used for extracting basic features and used as input of a feature fusion sub-network, a DLA34 backbone network used in a FairMOT algorithm is adopted, and other backbone networks such as ResNet can be adopted for replacement.

First feature fusion subnet A_sAnd a second feature fusion sub-network A_lEach containing three structurally identical spatial attention sub-networks Net_samThe Net_samThe convolution kernel size of the included two-dimensional convolution layer is 3x3, the step length is 1, and the output dimension is 1; channel attention subnetwork Net_camThe number of the two-dimensional convolution layers is 4, respectively, Net_camThe size of the convolution kernel of the included two-dimensional convolution layer is 1x1, and the step size is 1; the fusion sub-network replaces an equal-ratio fusion feature sub-network in FairMOT, is used for helping a follow-up task to provide a more appropriate feature map, and provides difference on a network structure for training a follow-up two key point heat map prediction sub-network due to the fact that structural parameters of the fusion sub-network are influenced by the follow-up multi-task.

Net_offset、Net_{hm_s}、Net_bbox、Net_{hm_l}And Net_reidThe convolution kernel size of the first convolutional layer and the second convolutional layer contained in the subnetwork is 3x3, with a step size of 1. In addition, the output channels of the convolution kernels of the first convolution layer of the above subnetwork are all set to 256, and the output channels of the second convolution layer are 2, 1, 4, 1, 128 respectively; the offset prediction result, the boundary frame prediction result and the heat map prediction result can be decoded to obtain a detection result, and the appearance similarity of the pedestrian can be measured by obtaining a re-identification feature vector according to a network and calculating the cosine similarity.

Net_reidThe output end cascade full-connection layer is used for assisting classification only during training, the output dimension is the number of pedestrians in the training data set, and the output dimension is discarded during testing.

Where N represents the number of keypoints in the predicted keypoint heat map, and alpha and beta represent the hyperparameters, taken here as 2 and 4 respectively,

step 3) iterative training is carried out on the detection and re-recognition integrated network model O based on the multivariate difference fusion:

(3b) for slave training sample set D_trainIn the random selection of bs ∈ [16,64 ]]Random data enhancement is carried out on each training sample, the detection frame information of each training sample is updated according to an enhancement mode, bs data enhancement training samples with updated detection frame information are obtained, and the updated information is obtainedThe ratio of high in the detection frame to high in the image frame is greater than the threshold th_ratioIs taken as a large target, and the ratio is less than a threshold th_ratioThe pedestrian target is used as a small target, and finally, the small target preference key point heat map label is determined according to the updated detection frame information, the updated identity information and the division result of the large target and the small target

Large target preference key point heat map label

Difference fusion key point heat map label

The concrete implementation steps are as follows:

(3b1) carrying out random theta angle rotation on each training sample, wherein theta belongs to < -5 > and 5 >, carrying out random scale change on each training sample after the random angle rotation by taking s as a coefficient, wherein s belongs to 0.9 and 1.1, then carrying out random image brightness change operation on each training sample after the random scale change by taking r as a coefficient, and wherein r belongs to < -0.2 and 0.2, and obtaining bs training samples after random data enhancement;

(3b2) and synchronously updating the detection frame labels according to the values of theta and s to obtain bs training samples with enhanced data after the detection frame information is updated.

(3b3) Determining a heat map of key points of large and small targets to predict sub-network training labels;

dividing the large and small targets by the height h of the pedestrian target frame_iRatio to high H of input image ratio as a division criterion:

wherein, divide represents the division result, if the division into HmL represents that the target is used as a supervised training sample of the large target preference key point heat map prediction sub-network prediction result, otherwise, the sample is ignored; if the classification HmS shows that the target is used as a supervised training sample of the small target preference key point heat map prediction sub-network prediction result, otherwise, the sample is ignored; for the predicted heat map Hm obtained by the fusion module, all target samples are supervised training samples; through the processes, the target sample division result of each key point heat map is obtained;

generating a key point heat map training label, and detecting a frame label for each pedestrian in the RGB image

Calculating the center point of the detection frame

And treating it as a target key point, wherein

Determining a key point training label value of

Wherein

For rounding-down, R is the down-sampling rate; finally obtaining a key point heat map label

Where x and y are the abscissa index values on the keypoint heat map,

for the tag value, σ, of the keypoint heat map at coordinate (x, y) location_cIs a target size adaptive standard deviation value; the division result through training samples and the heat map label function of the key points

Calculating key point heat map labels corresponding to the Hm _ S, Hm _ L and the Hm key point heat map to respectively obtain

(3b4) Determining re-recognition feature extraction sub-network Net_reidTraining labels, assuming a target identity is labeled with ID_iThe minimum identity label of the numerical value in the training set is ID_xBy calculating ID_i-ID_xThe result of (2) is sub-network Net_reidThe label value corresponding to the target

The set of identity tags of all targets in the predicted image is

(3b5) Determining a keypoint shift prediction subnetwork Net_offsetTraining labels, assuming that the coordinates of the center point of a certain target are p ═ cx, cy, and the quantized coordinates

Wherein

Represents a rounding-down operation, R represents a down-sampling step size, calculated

As a result, the sub-network Net is obtained_offsetTraining label value corresponding to the target

The keypoint offset prediction labels of all targets in the predicted image are

(3b6) Determining bounding box prediction sub-network Net_bboxTraining labels, assuming that the coordinates of the upper left corner and the lower right corner of a certain target frame are (x1, y1), (x2, y2), calculating

Obtaining sub-network Net_bboxTraining label corresponding to the target

The set of bounding box prediction labels of all targets in the predicted image is

(3d) First feature fusion subnet A_sFor Feat₁、Feat₂、Feat₃Performing self-adaptive fusion to obtain a feature map Feat_sThe keypoint shift prediction subnetwork Net_offsetSmall target preference keypoint heatmap prediction subnetwork Net_{hm_s}And bounding Box prediction subnetwork Net_bboxRespectively with Feat_sForward reasoning is carried out for input to obtain Net_offsetCorresponding keypoint offset predictor vector Vec_offset、Net_{hm_s}Corresponding keypoint heatmap predictions Hm _ S and Net of small target preferences_bboxCorresponding distance value vectors Vec from the key points to the upper, lower, left and right sides of the target frame_{dis_bbox}(ii) a Simultaneous second feature fusion subnetwork A_lFor Feat₁、Feat₂、Feat₃Performing self-adaptive fusion to obtain a feature map Feat_lLarge target preference keypoint heat map prediction subnetwork Net_{hm_l}And re-identifying feature extraction sub-network Net_reidRespectively with Feat_lFor input, forward reasoning is carried out to obtain Net_{hm_l}Corresponding big target preferred key point heatmap prediction results Hm _ L and Net_reidCorresponding re-identified feature vector Vec_reidFull connected layer pair Vec_reidClassifying to obtain a pedestrian identity classification result; the fusion module fuses the Hm _ S key point heat map and the Hm _ L key point heat map to obtain a fused key point heat map Hm;

wherein the first feature merges with the subnetwork A_sFor Feat₁、Feat₂、Feat₃Carrying out self-adaptive fusion, and comprising the following steps:

(3c1) feature fusion subnetwork A_sThere are three spatial attention sub-networks, one for each

And

wherein the spatial attention subnetwork

Backbone network Net_backboneOutput characteristic diagram Feat₁As its inputs, the network processing sequence is in turn: feat₁Input spatial attention subnetwork

And

multiplying the outputs of (1) to obtain Feat'₁→Feat₁And Feat'₁Addition → Feat₁"feature map, other two spatial attention subnetworks

And

respectively with Feat₂、Feat₃For input, Feat is obtained by the same process₂"feature map, Feat₃"feature maps, these three feature maps will be referred to as channel attention sub-network Net_camThe input of (1);

(3c2) for Feat₂"and Feat₃"both characteristic graphs are up-sampled by 2 times and 4 times respectively by using transposition convolution to obtain the result of the convolution with Feat₁Feature map Feat of uniform size₂″′、Feat₃"→ mixing Feat₁″、Feat₂″′、Feat₃' three characteristic diagrams are spliced to obtain Feat_sam→Feat_samInput channel attention subnetwork Net_cam→Feat_samAnd Net_camMultiplying the outputs of (A) to obtain Feat_cam→Feat_samWith Feat_camAddition → Feat_sObtaining a characteristic diagram Feat_s。

ComputingRespective loss value L_{hm_s}、L_{hm_l}、L_hmFinally to L_off、L_bbox、L_reid、L_{hm_s}、L_{hm_l}And L_hmCarrying out self-adaptive weighted summation to obtain a loss value L of the detection and re-identification integrated network model O_total；

Wherein the detecting and re-identifying loss value L of the integrated network model O_totalThe calculation formula is as follows:

L_det＝a×(0.6×L_hm+0.15×L_{hm_l}+0.25×L_{hm_s})+b×L_off+c×L_bbox

where parameters a, b, and c are constant term coefficients, where a is 1, b is 1, and c is 0.1, and w1 and w2 are learnable parameters.

wherein the gradient of the weight parameter passing through O is opposite to the weight parameter theta_JUpdating, wherein the updating formula is as follows:

wherein:

indicating the updated network parameters and the updated network parameters,

representing the network parameter before update, alpha_JThe step size is represented as a function of time,

representing the network parameter gradient of O.

step 4), acquiring a multi-target tracking result of the pedestrian:

(4a) initializing test sample set D_testThe kth test specimen is

(4b) Let p be 1;

(4c) set of test samples

wherein, in the Det set and Vec_reidThe implementation steps of obtaining the detection frame and the re-identification feature vector information corresponding to the pedestrian target in the vector are as follows:

(4d1) in the target object_iIndex the subscript in the Det set to obtain the detection frame Det_iAnd with the detection frame det_iIs indexed by the center point coordinates of (a) in Vec_reidInquiring in the vector to obtain the re-identification feature vector embed_i；

the correlation method adopts the same online correlation method as in the FairMOT algorithm, and specifically comprises the following steps:

(4e1) attributes defining the trajectory: the ordered set of detection frames of each pedestrian under the tracking scene is called a track Tra_iAnd each track has the following attributes: information of current trajectory target box

State of track

Re-identified feature vectors for trajectories

Life span length of trajectory

Number of consecutive unmatched frames

Motion information

Information of track object box

I.e. the coordinates of the upper left corner and the lower right corner of the containing frame; state of track

Definition, track State

The method comprises three states of an active state, a lost state and an inactive state, wherein the track of the active state is the track matched with a detection frame in the previous frame; the missing state track is the number of the frames which are not matched with the detection frame but are continuously not matched in the previous frame

Does not exceed the life span

Number of consecutive unmatched frames

Exceeds the life cycle length

The trajectory of (a) is an inactive state trajectory; re-identified feature vectors for trajectories

Re-identifying characteristic vectors representing the appearance of the track target, calculating cosine similarity of the vectors between the track and the detection during correlation matching to judge the possibility that the track and the detection belong to the same track; life span length of trajectory

Namely, the maximum frame number limit threshold value is continuously unmatched, and the track is set to be in an inactive state when the maximum frame number limit threshold value is exceeded; acquisition and processing of motion information

By usingAnd the Kalman filtering algorithm is used for estimating the horizontal and vertical coordinates of the target center position of all the positions of the tracks, the aspect ratio and height of the current target frame and the speed variables of the four states, and updating the parameters of the Kalman filtering algorithm according to the final matching result.

(4e2) For all active state tracks and all lost state tracks, firstly estimating the coordinate positions of the track frames through a Kalman filtering algorithm, and then calculating the Markov distance Matrix between the track frames and all detection target frames of the current frame_DisMotionFor matrix median greater than threshold th_mdIs modified into an infinite value, and the rest position values are not changed to obtain the final motion prediction distance Matrix'_DisMotionSimultaneously calculating cosine similarity distance Matrix between the track and the detected re-identification feature vector_DisEmbedFinally, the two are fused according to the following formula:

Matrix_Dis＝λMatrix_DisEmbed+(1-λ)Matrix′_DisMotion

obtaining the final distance Matrix_DisOptimizing and matching by adopting a Hungarian algorithm according to the distance matrix, and updating the track state;

(4e3) calculating the overlap proportion matrix of the unmatched active track and the detection frame in the last step, and finding out the matrix which has the maximum overlap proportion with the track and has the value larger than the threshold th_iouThe detection of (3) matches it and updates the track state;

(4e4) for the track which is still not matched, if the track is in the active state, the state of the track is changed into the state of the track in the lost state, the number of the lost frames of the track is counted, if the track is judged to be in the lost state, the lost count is increased by one, and the number of the lost frames is judged to be or larger than the life cycle of the track

If the state is larger than the required state, setting the state as an inactive state; initializing the detection which is not matched into a new track;

(4e5) outputting current frame matching correlation information;

(4f) judging whether P is more than or equal to P, if so, obtaining a test sampleBook (I)

Otherwise, making p equal to p +1, and updating the historical track set Tra^(k)And performing step (4 c);

The effect of the present invention is further explained by combining the simulation experiment as follows:

1. simulation conditions and contents:

the hardware platform of the simulation experiment is as follows: the graphics card is configured as Nvidia RTX2080Ti × 2, the processor is configured as xeon (r) E5-2620 v4@2.10Ghz × 32, and the memory is configured as 64 GB.

The software platform of the simulation experiment is as follows: the operating system is Ubuntu16.04LTS, the Python version is 3.7, the Pythroch version is 1.2.0, and the OpenCV version is 3.4.0.

The integrated network provided by the invention is characterized in that a training image sequence data set used in a simulation experiment is a crowdHuman, ETH, CityPerson, CalTech, CUHK-SYSU, PRW and MOT17train data set, the integrated network is pre-trained for 60 generations on the crowdHuman data set, and then is trained for 30 generations on the other data sets to obtain test model parameters; and the test image sequence data set is an MOT17test data set, wherein the test image sequence data set comprises image sequences under 7 different tracking scenes, and comprises 5919 frame images and 785 pedestrian tracks.

The Tracking accuracy of the multi-target Tracking method disclosed in the present invention and the paper "FairMOT of On the Fairness of Detection and Re-Identification in Multiple Object Tracking" published in 2020 by Yifu Zhang et al was compared and simulated, and the results are shown in Table 1

2. And (3) simulation result analysis:

in order to evaluate the tracking accuracy, the following evaluation index (tracking accuracy MOTA) formula is used to calculate the accuracy of the tracking results of the present invention and the prior art respectively, and the calculation results are plotted as table 1:

table 1.

Wherein, FN is the number of false negative targets, FP is the number of false positive targets, IDSW is the number of identity switching times, and GT is the number of targets in the truth label.

As can be seen from Table 1, compared with the prior art, the tracking accuracy of the invention is improved by 0.8, which is obviously higher than that of the prior art.

The above simulation experiments show that: according to the method, when the integrated network model for detecting and re-identifying based on the multivariate difference fusion is constructed, the difference of training data, a training mode and a network structure is added, so that the two key point heat map prediction sub-networks form prediction preference on targets with different sizes, the prediction results of the two sub-networks are added and fused to obtain the multivariate difference fusion key point heat map, the problem of low detection recall rate caused by prediction of the sub-networks by only using a single key point heat map in the prior art is solved, and the tracking accuracy of the algorithm is improved.

Claims

1. A pedestrian multi-target tracking method based on multivariate difference fusion is characterized by comprising the following steps:

(1) obtaining a training sample set D_trainAnd test sample set D_test：

first feature fusion subnet A_sAnd a second feature fusion sub-network A_lAll comprise a plurality ofIndividual spatial attention subnetwork Net_samAnd a channel attention subnetwork Net_cam(ii) a Spatial attention subnetwork Net_samComprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, and a two-dimensional convolution layer connected with the two pooling layers, a channel attention sub-network Net_camThe system comprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, wherein the global average pooling layer and the global maximum pooling layer are respectively connected with two-dimensional convolutional layers in a cascade mode; net_offset、Net_{hm_s}、Net_bbox、Net_{hm_l}And Net_reidThe sub-networks all adopt a structure comprising a first convolutional layer, a rule active layer and a second convolutional layer which are sequentially cascaded, and Net_reidThe output end is cascaded with a full connection layer, Net_{hm_s}And Net_{hm_l}The output end is cascaded with the fusion module;

(3b) for slave training sample set D_trainIn the random selection of bs ∈ [16,64 ]]Random data enhancement is carried out on each training sample, and the detection frame letter of each training sample is subjected to the enhancement modeUpdating information to obtain bs data enhanced training samples after the information of the detection frame is updated, and enabling the ratio of the height in the detection frame to the height of the image frame after the information is updated to be larger than a threshold th_ratioIs taken as a large target, and the ratio is less than a threshold th_ratioThe pedestrian target is used as a small target, and finally, the small target preference key point heat map label is determined according to the updated detection frame information, the updated identity information and the division result of the large target and the small target

Large target preference key point heat map label

Difference fusion key point heat map label

(3d) First feature fusion subnet A_sFor Feat₁、Feat₂、Feat₃Performing self-adaptive fusion to obtain a feature map Feat_sThe keypoint shift prediction subnetwork Net_offsetSmall target preference keypoint heatmap prediction subnetwork Net_{hm_s}And bounding Box prediction subnetwork Net_bboxRespectively with Feat_sForward reasoning is carried out for input to obtain Net_offsetCorresponding keypoint offset predictor vector Vec_offset、Net_{hm_s}Corresponding keypoint heatmap predictions Hm _ S and Net of small target preferences_bboxCorresponding distance value vectors Vec from the key points to the upper, lower, left and right sides of the target frame_{dis_bbox}(ii) a All in oneTemporal second feature fusion subnetwork A_lFor Feat₁、Feat₂、Feat₃Performing self-adaptive fusion to obtain a feature map Feat_lLarge target preference keypoint heat map prediction subnetwork Net_{hm_l}And re-identifying feature extraction sub-network Net_reidRespectively with Feat_lFor input, forward reasoning is carried out to obtain Net_{hm_l}Corresponding big target preferred key point heatmap prediction results Hm _ L and Net_reidCorresponding re-identified feature vector Vec_reidFull connected layer pair Vec_reidClassifying to obtain a pedestrian identity classification result; the fusion module fuses the Hm _ S key point heat map and the Hm _ L key point heat map to obtain a fused key point heat map Hm;

(3f) Using a back propagation method and passing through the loss value L_totalCalculating the gradient of the weight parameter of the integrated network model O for detection and re-identification, and thenUsing gradient descent algorithm, the weight parameter gradient of O is used to match the weight parameter theta_JUpdating is carried out;

(4) acquiring a multi-target tracking result of the pedestrian:

(4a) initializing test sample set D_testThe kth test specimen is

(4b) Let p be 1;

(4c) set of test samples

(4e) according to the detection frame and the re-recognition characteristics of the screened pedestrian targetVector information, adopting an online correlation method to carry out on the screened pedestrian target set Object and the history track set Tra^(k)Carrying out data association to obtain f^(p)The multi-target tracking result of the pedestrians is obtained;

2. The pedestrian multi-target tracking method based on multivariate difference fusion as claimed in claim 1, wherein the preprocessing is performed on the selected V RGB image sequences with the pedestrian detection frame tags and the identity tags in step (1a), and the implementation steps are as follows:

(1a1) adjusting the size of each RGB image frame in each RGB image sequence by adopting a bilinear interpolation method to obtain an RGB image frame sequence set S with all the same RGB image frame sizes_v′；

(1a2) Assembling a sequence of RGB image frames S_vIn the method, the pedestrian detection frame label and the image after the scale change are updated synchronously, and meanwhile, the pedestrian identity label is uniformly coded, namely, for the data sample lacking identity information, the identity label is set to be-1, the coding mark is sequentially increased from 1 for the pedestrian with each individual identity, and the RGB image frame sequence set frame sequence after the RGB image frame is subjected to size adjustment and the detection frame label and the identity label are updated is obtained

The method and the device realize preprocessing of the V RGB image sequence with the pedestrian detection frame tag and the identity tag.

3. The multi-difference fusion-based pedestrian multi-target tracking method according to claim 1, wherein the multi-difference fusion-based detection and re-identification of the structure of the integrated network model O in step (2a) is performed, wherein:

backbone network Net_backboneThe number of the contained two-dimensional convolution layers, the batch normalization layer, the two-dimensional pooling layer, the deformable convolution layer and the transposition convolution layer is respectively 27, 37, 6, 4 and 2;

first feature fusion subnet A_sAnd a second feature fusion sub-network A_lEach containing three structurally identical spatial attention sub-networks Net_samThe Net_samThe convolution kernel size of the included two-dimensional convolution layer is 3x3, the step length is 1, and the output dimension is 1; channel attention subnetwork Net_camThe number of the two-dimensional convolution layers is 4, respectively, Net_camThe size of the convolution kernel of the included two-dimensional convolution layer is 1x1, and the step size is 1;

Net_offset、Net_{hm_s}、Net_bbox、Net_{hm_l}and Net_reidThe convolution kernel size of the first convolutional layer and the second convolutional layer contained in the subnetwork is 3x3, with a step size of 1.

4. The multi-difference fusion-based pedestrian multi-target tracking method according to claim 1, wherein the pairs in the step (3b) are selected from a training sample set D_trainThe method comprises the following steps of carrying out random data enhancement on bs training samples randomly selected from the training samples, and updating the detection frame information of each training sample according to an enhancement mode, wherein the method comprises the following steps:

5. The multi-difference fusion-based pedestrian multi-target tracking method according to claim 1, wherein the step (3e) of detecting and re-identifying the loss value L of the integrated network model O_totalThe calculation formula is as follows:

L_det＝a×(0.6×L_hm+0.15×L_{hm_l}+0.25×L_{hm_s})+b×L_off+c×L_bbox

where the parameters a, b, and c are constant term coefficients, and w1 and w2 are learnable parameters.

6. The multi-difference fusion-based pedestrian multi-target tracking method according to claim 1, wherein the gradient of the weight parameter passing through O in the step (3f) is relative to the weight parameter θ_JUpdating, wherein the updating formula is as follows:

wherein:

indicating the updated network parameters and the updated network parameters,

representing the network parameter gradient of O.