CN113221787A - Pedestrian multi-target tracking method based on multivariate difference fusion - Google Patents

Pedestrian multi-target tracking method based on multivariate difference fusion Download PDF

Info

Publication number
CN113221787A
CN113221787A CN202110556574.7A CN202110556574A CN113221787A CN 113221787 A CN113221787 A CN 113221787A CN 202110556574 A CN202110556574 A CN 202110556574A CN 113221787 A CN113221787 A CN 113221787A
Authority
CN
China
Prior art keywords
net
pedestrian
detection
fusion
key point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110556574.7A
Other languages
Chinese (zh)
Other versions
CN113221787B (en
Inventor
韩红
迟勇欣
张齐驰
王毅飞
范迎春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202110556574.7A priority Critical patent/CN113221787B/en
Publication of CN113221787A publication Critical patent/CN113221787A/en
Application granted granted Critical
Publication of CN113221787B publication Critical patent/CN113221787B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a pedestrian multi-target tracking method based on multivariate difference fusion, which comprises the following steps of: (1) acquiring a training sample set and a test sample set; (2) constructing a detection and re-identification integrated network model based on multi-element difference fusion; (3) performing iterative training on a detection and re-recognition integrated network model based on multivariate difference fusion; (4) and acquiring a multi-target tracking result of the pedestrian. According to the method, when the integrated network model for detecting and re-identifying based on the multivariate difference fusion is constructed, the difference of training data, a training mode and a network structure is added, so that the two key point heat map prediction sub-networks form prediction preference on targets with different sizes, the prediction results of the two sub-networks are added and fused to obtain the multivariate difference fusion key point heat map, the problem of low detection recall rate caused by prediction of the sub-networks by only using a single key point heat map in the prior art is solved, and the tracking accuracy of the algorithm is improved.

Description

Pedestrian multi-target tracking method based on multivariate difference fusion
Technical Field
The invention belongs to the technical field of computer vision, and relates to a pedestrian multi-target tracking method based on multivariate difference fusion, which can be used for monitoring pedestrian multi-target tracking tasks in the fields of security protection, video content understanding, human-computer interaction and the like.
Background
The pedestrian multi-target tracking algorithm is widely applied to the fields of security monitoring, video content understanding, man-machine interaction, intelligent nursing and the like. In recent years, with the rise and popularization of deep learning, the pedestrian multi-target tracking algorithm gradually forms an algorithm paradigm of combining three basic modules of target detection, re-recognition feature extraction and data association. The object detection module is used for detecting all pedestrian objects in a positioning scene, the re-identification feature extraction module is used for extracting and coding pedestrian appearance information, and the data association module estimates the similarity between a historical track and a detected pedestrian in a current frame according to the information provided by the detection and re-identification feature extraction module and performs optimal association matching according to the similarity so as to form the track.
Yufu Zhang et al, in 2020, "FairMOT" published by IEEE Conference On Computer Vision and Pattern Recognition, discloses a multi-target Tracking algorithm integrating Detection and Re-Identification tasks into a network, which adds a Re-Identification feature extraction sub-network On the CenterNet Detection network to make the Detection and Re-Identification tasks share a large number of convolutional layer parameters and features, thereby reducing the number of network parameters and calculation, improving the execution efficiency of the system, and achieving good results in the balance of speed and precision.
However, in the FairMOT algorithm, only the weight extraction tasks which can be detected and identified again are simply integrated, and the four prediction task branch sub-networks only share one fusion feature map, so that the intense competition of features among tasks is caused, and further learning of each task is inhibited; in addition, the FairMOT algorithm ignores the characteristic difference between targets with larger scale difference under the scene with larger scale difference of the targets, only one target central point heat map prediction sub-network is adopted to detect the targets with all scales to be recalled, although the convolutional neural network has the capacity of learning and adapting to the changes of scale, texture and the like, for the learning between the targets with larger difference, the network often seeks a balance between the targets, so that the detection recall effect of the model on the pedestrian targets is inhibited, and the accuracy of multi-target tracking is reduced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a pedestrian multi-target tracking method based on multi-element difference fusion, which is used for solving the technical problem of low detection recall rate in the scene with large target scale difference in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) obtaining a training sample set DtrainAnd test sample set Dtest
(1a) Preprocessing the selected V RGB image sequences with the pedestrian detection frame labels and the identity labels to obtain a preprocessed RGB image frame sequence set
Figure BDA0003071456140000021
And mixing SvTaking RGB image frames contained in the I preprocessed RGB image frame sequences as a training sample set DtrainTaking the rest K preprocessed RGB image frame sequences as a test sample set DtestWherein
Figure BDA0003071456140000022
Denotes the mth one containing LmFrame pre-processed RGB imageA sequence of frames comprising a sequence of frames,
Figure BDA0003071456140000023
f(n)representing the n-th preprocessed RGB image frame, I > K, I + K ═ V, V > 20, Lm>200;
(2) Constructing a detection and re-identification integrated network model O based on multivariate difference fusion:
(2a) constructing a structure of a detection and re-identification integrated network model O based on multi-element difference fusion:
structure for constructing detection and re-identification integrated network model O, including backbone network NetbackboneAnd NetbackboneCascaded parallel-arranged and same-structure first feature fusion sub-network AsAnd a second feature fusion sub-network AlAnd a convergence module, wherein the first feature converges the subnetwork AsThe output end of the network is connected with a parallelly arranged key point deviation prediction sub-network NetoffsetSmall target preference keypoint heatmap prediction subnetwork Nethm_sAnd bounding Box prediction subnetwork NetbboxSecond feature fusion subnetwork AlThe output end of the network is connected with a large target preference key point heat map prediction sub-network Net which is arranged in parallelhm_lAnd re-identifying feature extraction sub-network NetreidWherein:
backbone network NetbackboneAdopting a tree-shaped polymerization iterative network consisting of a plurality of two-dimensional convolution layers, a plurality of batch normalization layers, a plurality of two-dimensional pooling layers, a plurality of deformable convolution layers and a plurality of transposition convolution layers;
first feature fusion subnet AsAnd a second feature fusion sub-network AlEach comprising a plurality of spatial attention sub-networks NetsamAnd a channel attention subnetwork Netcam(ii) a Spatial attention subnetwork NetsamComprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, and a two-dimensional convolution layer connected with the two pooling layers, a channel attention sub-network NetcamComprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, wherein the global average pooling layer and the global maximum pooling layer are respectively connected with two in a cascade wayA two-dimensional convolution layer; netoffset、Nethm_s、Netbbox、Nethm_lAnd NetreidThe sub-networks all adopt a structure comprising a first convolutional layer, a rule active layer and a second convolutional layer which are sequentially cascaded, and NetreidThe output end is cascaded with a full connection layer, Nethm_sAnd Nethm_lThe output end is cascaded with the fusion module;
(2b) defining a loss function L for a keypoint heat map prediction taskheatmap
Figure BDA0003071456140000031
Where N represents the number of keypoints in the predicted keypoint heat map, alpha and beta represent hyper-parameters,
Figure BDA0003071456140000032
and YxyRespectively representing the labels and response values of key points at coordinates (x, y) in the predicted key point heat map, sigma representing summation operation, and log representing logarithm operation;
(3) carrying out iterative training on a detection and re-recognition integrated network model O based on multivariate difference fusion:
(3a) initializing detection and re-identifying integrated network model O with weight parameter thetaJThe iteration frequency is T, the maximum iteration frequency is T, T is more than or equal to 50000, and T is made to be 0;
(3b) for slave training sample set DtrainIn the random selection of bs ∈ [16,64 ]]Random data enhancement is carried out on each training sample, detection frame information of each training sample is updated according to an enhancement mode, bs data enhancement training samples with updated detection frame information are obtained, and the ratio of the height in the detection frame with updated information to the height of the image frame is larger than a threshold thratioIs taken as a large target, and the ratio is less than a threshold thratioThe pedestrian target is used as a small target, and finally, the small target preference key point heat map label is determined according to the updated detection frame information, the updated identity information and the division result of the large target and the small target
Figure BDA0003071456140000033
Large target preference key point heat map label
Figure BDA0003071456140000034
Difference fusion key point heat map label
Figure BDA0003071456140000035
Bounding box labelbboxKey point offset labeloffsetRe-identification identity labelid
(3c) Using the bs training samples after data enhancement as the input of the detection and re-recognition integrated network model O, namely the backbone network NetbackboneExtracting features of each training sample to obtain three feature maps Feat with different scales of the training sample1、Feat2、Feat3
(3d) First feature fusion subnet AsFor Feat1、Feat2、Feat3Performing self-adaptive fusion to obtain a feature map FeatsThe keypoint shift prediction subnetwork NetoffsetSmall target preference keypoint heatmap prediction subnetwork Nethm_sAnd bounding Box prediction subnetwork NetbboxRespectively with FeatsForward reasoning is carried out for input to obtain NetoffsetCorresponding keypoint offset predictor vector Vecoffset、Nethm_sCorresponding keypoint heatmap predictions Hm _ S and Net of small target preferencesbboxCorresponding distance value vectors Vec from the key points to the upper, lower, left and right sides of the target framedis_bbox(ii) a Simultaneous second feature fusion subnetwork AlFor Feat1、Feat2、Feat3Performing self-adaptive fusion to obtain a feature map FeatlLarge target preference keypoint heat map prediction subnetwork Nethm_lAnd re-identifying feature extraction sub-network NetreidRespectively with FeatlFor input, forward reasoning is carried out to obtain Nethm_lCorresponding big target preferred key point heatmap prediction results Hm _ L and NetreidCorresponding re-identified feature vector VecreidIs totally connected toJoining layer pair VecreidClassifying to obtain a pedestrian identity classification result; the fusion module fuses the Hm _ S key point heat map and the Hm _ L key point heat map to obtain a fused key point heat map Hm;
(3e) using L1 loss function, through inputting the predicted value of the shift of the key point and the label thereofoffsetCalculating a loss value L of the keypoint shift prediction resultoffSimultaneously inputting the predicted value of the bounding box and label thereofbboxCalculating a loss value L for a bounding box predictorbboxAnd adopting cross entropy loss function, through inputting pedestrian identity classification result and label thereofidCalculating loss value L of re-recognition feature extraction resultreidThen, a loss function L of the task is predicted by adopting the key point heat mapheatmapRespectively inputting Hm _ S, Hm _ L and Hm and corresponding label
Figure BDA0003071456140000041
Figure BDA0003071456140000042
Calculating respective loss values Lhm_s、Lhm_l、LhmFinally to Loff、Lbbox、Lreid、Lhm_s、Lhm_lAnd LhmCarrying out self-adaptive weighted summation to obtain a loss value L of the detection and re-identification integrated network model Ototal
(3f) Using a back propagation method and passing through the loss value LtotalCalculating the gradient of the weight parameter of the detection and re-identification integrated network model O, and then adopting a gradient descent algorithm to carry out the pair of the weight parameter theta through the gradient of the weight parameter of OJUpdating is carried out;
(3g) judging whether T is greater than T, if so, obtaining a trained detection and re-recognition integrated network model O', otherwise, making T equal to T +1, and executing the step (3 b);
(4) acquiring a multi-target tracking result of the pedestrian:
(4a) initializing test sample set DtestThe kth test specimen is
Figure BDA0003071456140000051
Comprises P RGB image frames, the P-th RGB image frame is f(p)P is more than 200, k is equal to 1, and the historical track set Tra is initialized(k)={};
(4b) Let p be 1;
(4c) set of test samples
Figure BDA0003071456140000052
P-th RGB image frame f in (1)(p)Forward propagation as input to a trained detection and re-recognition integrated network model O' to yield f(p)Is predicted value Vec of the key point offsetoffsetDistance values Vec from the periphery of the target frame to the key points up, down, left and rightdis_bboxKey point heat map prediction result Hm and re-recognition feature vector VecreidAnd to Vecoffset、Vecdis_bboxAnd Hm to obtain f(p)Set of pedestrian detection frames Det ═ { Det ═ DetiI is more than or equal to |0 and less than or equal to DN-1}, wherein detiA detection box for the ith pedestrian, DN denotes f(p)Detecting the number of pedestrians;
(4d) screening out f(p)The detected key point response value confiGreater than a response threshold thconfObject of the pedestrian Object { Object ═ Object }i|confiIs greater than th, i is greater than or equal to 0 and is less than or equal to DN-1, and is collected in the Det and VecreidObtaining a detection frame and re-identification feature vector information corresponding to the pedestrian target from the vector;
(4e) according to the detection frame and the re-recognition characteristic vector information of the screened pedestrian target, adopting an online correlation method to carry out on-line correlation on the Object and the Tra in the screened pedestrian target set(k)Carrying out data association to obtain f(p)The multi-target tracking result of the pedestrians is obtained;
(4f) judging whether P is more than or equal to P, if so, obtaining a test sample
Figure BDA0003071456140000053
Otherwise, making p equal to p +1, and updating the historical track set Tra(k)And performStep (4 c);
(4g) judging whether K is more than or equal to K, if so, obtaining a test sample set DtestOtherwise, k is made to be k +1, and step (4b) is executed.
Compared with the prior art, the invention has the following advantages:
1. when a detection and re-recognition integrated network model based on multi-element difference fusion is constructed, two feature fusion sub-networks which are arranged in parallel are respectively cascaded with a key point heat map prediction sub-network, a training mode difference and a training data difference are added through designing a loss function and a training mode, so that the two key point heat map prediction sub-networks form prediction preference on targets with different sizes, the difference results of the two sub-networks are added and fused to obtain the multi-element difference fusion key point heat map.
2. In addition, after a plurality of prediction sub-networks are separated and respectively cascaded to two feature fusion sub-networks which are arranged in parallel, the competition degree of the features among the plurality of prediction tasks is reduced, the network structure difference is added to a key point heat map of multi-element difference fusion, and the tracking accuracy of the algorithm is further improved.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Fig. 2 is a schematic structural diagram of the integrated network for detecting and re-identifying based on multivariate difference fusion according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments.
Referring to fig. 1, the present invention includes the steps of:
step 1) obtaining a training sample set DtrainAnd test sample set Dtest
Step 1a) preprocessing the selected V RGB image sequences with pedestrian detection frame labels and identity labels to obtain a preprocessed RGB image frame sequence set
Figure BDA0003071456140000061
And mixing SvTaking RGB image frames contained in the I preprocessed RGB image frame sequences as a training sample set DtrainTaking the rest K preprocessed RGB image frame sequences as a test sample set DtestWherein
Figure BDA0003071456140000062
Denotes the mth one containing LmA sequence of frame pre-processed RGB image frames,
Figure BDA0003071456140000063
f(n)representing the n-th preprocessed RGB image frame, I > K, I + K ═ V, V > 20, LmIs more than 200; in the example, crowdHuman, ETH, CityPerson, CalTech, CUHK-SYSU, PRW and MOT17train data sets with rich scenes are used as training data sets, the generalization capability of the model is improved, and the MOT17test data set with 7 image sequences in different scenes and the average frame number of the sequences of 845 is used for testing, so that the tracking accuracy is reasonably tested.
The method comprises the following steps of selecting V RGB image sequences with pedestrian detection frame labels and identity labels, preprocessing the selected V RGB image sequences, and realizing the following steps:
(1a1) adjusting the size of each RGB image frame in each RGB image sequence by a bilinear interpolation method to obtain an RGB image frame sequence set S with all RGB image frames of 608 × 1088v' so as to be consistent with the network input size.
(1a2) Assembling a sequence of RGB image frames SvIn the method, the pedestrian detection frame label and the image after the scale change are updated synchronously, and meanwhile, the pedestrian identity label is uniformly coded, namely, the identity label of the data sample with the missing identity information is setSetting the number as-1, sequentially increasing the coding mark from 1 for each pedestrian with single identity to obtain an RGB image frame sequence set which adjusts the size of an RGB image frame and updates a detection frame label and an identity label
Figure BDA0003071456140000071
The method has the advantages that the V RGB image sequences with the pedestrian detection frame labels and the identity labels are preprocessed, and the consistency of the pedestrian identities and the consistency of the labels and pictures in the training or testing process are guaranteed.
Step 2), constructing a detection and re-identification integrated network model O based on multivariate difference fusion:
(2a) constructing a structure of a detection and re-identification integrated network model O based on multi-element difference fusion:
structure for constructing detection and re-identification integrated network model O, including backbone network NetbackboneAnd NetbackboneCascaded parallel-arranged and same-structure first feature fusion sub-network AsAnd a second feature fusion sub-network AlAnd a convergence module, wherein the first feature converges the subnetwork AsThe output end of the network is connected with a parallelly arranged key point deviation prediction sub-network NetoffsetSmall target preference keypoint heatmap prediction subnetwork Nethm_sAnd bounding Box prediction subnetwork NetbboxSecond feature fusion subnetwork AlThe output end of the network is connected with a large target preference key point heat map prediction sub-network Net which is arranged in parallelhm_lAnd re-identifying feature extraction sub-network NetreidWherein:
backbone network NetbackboneAdopting a tree-shaped polymerization iterative network consisting of a plurality of two-dimensional convolution layers, a plurality of batch normalization layers, a plurality of two-dimensional pooling layers, a plurality of deformable convolution layers and a plurality of transposition convolution layers;
first feature fusion subnet AsAnd a second feature fusion sub-network AlEach comprising a plurality of spatial attention sub-networks NetsamAnd a channel attention subnetwork Netcam(ii) a Spatial attention subnetwork NetsamIncluding global planes arranged in parallelAn average pooling layer and a global maximum pooling layer, and a two-dimensional convolutional layer connected to the two pooling layers, a channel attention subnetwork NetcamThe system comprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, wherein the global average pooling layer and the global maximum pooling layer are respectively connected with two-dimensional convolutional layers in a cascade mode; netoffset、Nethm_s、Netbbox、Nethm_lAnd NetreidThe sub-networks all adopt a structure comprising a first convolutional layer, a rule active layer and a second convolutional layer which are sequentially cascaded, and NetreidThe output end is cascaded with a full connection layer, Nethm_sAnd Nethm_lThe output end is cascaded with the fusion module;
wherein the structure of the integrated network model O is detected and re-identified based on the multivariate difference fusion, wherein:
backbone network NetbackboneThe number of the contained two-dimensional convolution layers, the batch normalization layer, the two-dimensional pooling layer, the deformable convolution layer and the transposition convolution layer is respectively 27, 37, 6, 4 and 2; the network is used for extracting basic features and used as input of a feature fusion sub-network, a DLA34 backbone network used in a FairMOT algorithm is adopted, and other backbone networks such as ResNet can be adopted for replacement.
First feature fusion subnet AsAnd a second feature fusion sub-network AlEach containing three structurally identical spatial attention sub-networks NetsamThe NetsamThe convolution kernel size of the included two-dimensional convolution layer is 3x3, the step length is 1, and the output dimension is 1; channel attention subnetwork NetcamThe number of the two-dimensional convolution layers is 4, respectively, NetcamThe size of the convolution kernel of the included two-dimensional convolution layer is 1x1, and the step size is 1; the fusion sub-network replaces an equal-ratio fusion feature sub-network in FairMOT, is used for helping a follow-up task to provide a more appropriate feature map, and provides difference on a network structure for training a follow-up two key point heat map prediction sub-network due to the fact that structural parameters of the fusion sub-network are influenced by the follow-up multi-task.
Netoffset、Nethm_s、Netbbox、Nethm_lAnd NetreidThe convolution kernel size of the first convolutional layer and the second convolutional layer contained in the subnetwork is 3x3, with a step size of 1. In addition, the output channels of the convolution kernels of the first convolution layer of the above subnetwork are all set to 256, and the output channels of the second convolution layer are 2, 1, 4, 1, 128 respectively; the offset prediction result, the boundary frame prediction result and the heat map prediction result can be decoded to obtain a detection result, and the appearance similarity of the pedestrian can be measured by obtaining a re-identification feature vector according to a network and calculating the cosine similarity.
NetreidThe output end cascade full-connection layer is used for assisting classification only during training, the output dimension is the number of pedestrians in the training data set, and the output dimension is discarded during testing.
(2b) Defining a loss function L for a keypoint heat map prediction taskheatmap
Figure BDA0003071456140000081
Where N represents the number of keypoints in the predicted keypoint heat map, and alpha and beta represent the hyperparameters, taken here as 2 and 4 respectively,
Figure BDA0003071456140000082
and YxyRespectively representing the labels and response values of key points at coordinates (x, y) in the predicted key point heat map, sigma representing summation operation, and log representing logarithm operation;
step 3) iterative training is carried out on the detection and re-recognition integrated network model O based on the multivariate difference fusion:
(3a) initializing detection and re-identifying integrated network model O with weight parameter thetaJThe iteration frequency is T, the maximum iteration frequency is T, T is more than or equal to 50000, and T is made to be 0;
(3b) for slave training sample set DtrainIn the random selection of bs ∈ [16,64 ]]Random data enhancement is carried out on each training sample, the detection frame information of each training sample is updated according to an enhancement mode, bs data enhancement training samples with updated detection frame information are obtained, and the updated information is obtainedThe ratio of high in the detection frame to high in the image frame is greater than the threshold thratioIs taken as a large target, and the ratio is less than a threshold thratioThe pedestrian target is used as a small target, and finally, the small target preference key point heat map label is determined according to the updated detection frame information, the updated identity information and the division result of the large target and the small target
Figure BDA0003071456140000091
Large target preference key point heat map label
Figure BDA0003071456140000092
Difference fusion key point heat map label
Figure BDA0003071456140000093
Bounding box labelbboxKey point offset labeloffsetRe-identification identity labelid
The concrete implementation steps are as follows:
(3b1) carrying out random theta angle rotation on each training sample, wherein theta belongs to < -5 > and 5 >, carrying out random scale change on each training sample after the random angle rotation by taking s as a coefficient, wherein s belongs to 0.9 and 1.1, then carrying out random image brightness change operation on each training sample after the random scale change by taking r as a coefficient, and wherein r belongs to < -0.2 and 0.2, and obtaining bs training samples after random data enhancement;
(3b2) and synchronously updating the detection frame labels according to the values of theta and s to obtain bs training samples with enhanced data after the detection frame information is updated.
(3b3) Determining a heat map of key points of large and small targets to predict sub-network training labels;
dividing the large and small targets by the height h of the pedestrian target frameiRatio to high H of input image ratio as a division criterion:
Figure BDA0003071456140000094
Figure BDA0003071456140000095
wherein, divide represents the division result, if the division into HmL represents that the target is used as a supervised training sample of the large target preference key point heat map prediction sub-network prediction result, otherwise, the sample is ignored; if the classification HmS shows that the target is used as a supervised training sample of the small target preference key point heat map prediction sub-network prediction result, otherwise, the sample is ignored; for the predicted heat map Hm obtained by the fusion module, all target samples are supervised training samples; through the processes, the target sample division result of each key point heat map is obtained;
generating a key point heat map training label, and detecting a frame label for each pedestrian in the RGB image
Figure BDA0003071456140000101
Calculating the center point of the detection frame
Figure BDA0003071456140000102
And treating it as a target key point, wherein
Figure BDA0003071456140000103
Figure BDA0003071456140000104
Determining a key point training label value of
Figure BDA0003071456140000105
Wherein
Figure BDA0003071456140000106
For rounding-down, R is the down-sampling rate; finally obtaining a key point heat map label
Figure BDA0003071456140000107
Figure BDA0003071456140000108
Where x and y are the abscissa index values on the keypoint heat map,
Figure BDA0003071456140000109
for the tag value, σ, of the keypoint heat map at coordinate (x, y) locationcIs a target size adaptive standard deviation value; the division result through training samples and the heat map label function of the key points
Figure BDA00030714561400001010
Calculating key point heat map labels corresponding to the Hm _ S, Hm _ L and the Hm key point heat map to respectively obtain
Figure BDA00030714561400001011
(3b4) Determining re-recognition feature extraction sub-network NetreidTraining labels, assuming a target identity is labeled with IDiThe minimum identity label of the numerical value in the training set is IDxBy calculating IDi-IDxThe result of (2) is sub-network NetreidThe label value corresponding to the target
Figure BDA00030714561400001012
The set of identity tags of all targets in the predicted image is
Figure BDA00030714561400001013
(3b5) Determining a keypoint shift prediction subnetwork NetoffsetTraining labels, assuming that the coordinates of the center point of a certain target are p ═ cx, cy, and the quantized coordinates
Figure BDA00030714561400001014
Wherein
Figure BDA00030714561400001015
Represents a rounding-down operation, R represents a down-sampling step size, calculated
Figure BDA00030714561400001016
As a result, the sub-network Net is obtainedoffsetTraining label value corresponding to the target
Figure BDA00030714561400001017
The keypoint offset prediction labels of all targets in the predicted image are
Figure BDA00030714561400001018
(3b6) Determining bounding box prediction sub-network NetbboxTraining labels, assuming that the coordinates of the upper left corner and the lower right corner of a certain target frame are (x1, y1), (x2, y2), calculating
Figure BDA00030714561400001019
Obtaining sub-network NetbboxTraining label corresponding to the target
Figure BDA00030714561400001020
The set of bounding box prediction labels of all targets in the predicted image is
Figure BDA00030714561400001021
(3c) Using the bs training samples after data enhancement as the input of the detection and re-recognition integrated network model O, namely the backbone network NetbackboneExtracting features of each training sample to obtain three feature maps Feat with different scales of the training sample1、Feat2、Feat3
(3d) First feature fusion subnet AsFor Feat1、Feat2、Feat3Performing self-adaptive fusion to obtain a feature map FeatsThe keypoint shift prediction subnetwork NetoffsetSmall target preference keypoint heatmap prediction subnetwork Nethm_sAnd bounding Box prediction subnetwork NetbboxRespectively with FeatsForward reasoning is carried out for input to obtain NetoffsetCorresponding keypoint offset predictor vector Vecoffset、Nethm_sCorresponding keypoint heatmap predictions Hm _ S and Net of small target preferencesbboxCorresponding distance value vectors Vec from the key points to the upper, lower, left and right sides of the target framedis_bbox(ii) a Simultaneous second feature fusion subnetwork AlFor Feat1、Feat2、Feat3Performing self-adaptive fusion to obtain a feature map FeatlLarge target preference keypoint heat map prediction subnetwork Nethm_lAnd re-identifying feature extraction sub-network NetreidRespectively with FeatlFor input, forward reasoning is carried out to obtain Nethm_lCorresponding big target preferred key point heatmap prediction results Hm _ L and NetreidCorresponding re-identified feature vector VecreidFull connected layer pair VecreidClassifying to obtain a pedestrian identity classification result; the fusion module fuses the Hm _ S key point heat map and the Hm _ L key point heat map to obtain a fused key point heat map Hm;
wherein the first feature merges with the subnetwork AsFor Feat1、Feat2、Feat3Carrying out self-adaptive fusion, and comprising the following steps:
(3c1) feature fusion subnetwork AsThere are three spatial attention sub-networks, one for each
Figure BDA0003071456140000111
And
Figure BDA0003071456140000112
wherein the spatial attention subnetwork
Figure BDA0003071456140000113
Backbone network NetbackboneOutput characteristic diagram Feat1As its inputs, the network processing sequence is in turn: feat1Input spatial attention subnetwork
Figure BDA0003071456140000114
And
Figure BDA0003071456140000115
multiplying the outputs of (1) to obtain Feat'1→Feat1And Feat'1Addition → Feat1"feature map, other two spatial attention subnetworks
Figure BDA0003071456140000116
And
Figure BDA0003071456140000117
respectively with Feat2、Feat3For input, Feat is obtained by the same process2"feature map, Feat3"feature maps, these three feature maps will be referred to as channel attention sub-network NetcamThe input of (1);
(3c2) for Feat2"and Feat3"both characteristic graphs are up-sampled by 2 times and 4 times respectively by using transposition convolution to obtain the result of the convolution with Feat1Feature map Feat of uniform size2″′、Feat3"→ mixing Feat1″、Feat2″′、Feat3' three characteristic diagrams are spliced to obtain Featsam→FeatsamInput channel attention subnetwork Netcam→FeatsamAnd NetcamMultiplying the outputs of (A) to obtain Featcam→FeatsamWith FeatcamAddition → FeatsObtaining a characteristic diagram Feats
(3e) Using L1 loss function, through inputting the predicted value of the shift of the key point and the label thereofoffsetCalculating a loss value L of the keypoint shift prediction resultoffSimultaneously inputting the predicted value of the bounding box and label thereofbboxCalculating a loss value L for a bounding box predictorbboxAnd adopting cross entropy loss function, through inputting pedestrian identity classification result and label thereofidCalculating loss value L of re-recognition feature extraction resultreidThen, a loss function L of the task is predicted by adopting the key point heat mapheatmapRespectively inputting Hm _ S, Hm _ L and Hm and corresponding label
Figure BDA0003071456140000121
Figure BDA0003071456140000122
ComputingRespective loss value Lhm_s、Lhm_l、LhmFinally to Loff、Lbbox、Lreid、Lhm_s、Lhm_lAnd LhmCarrying out self-adaptive weighted summation to obtain a loss value L of the detection and re-identification integrated network model Ototal
Wherein the detecting and re-identifying loss value L of the integrated network model OtotalThe calculation formula is as follows:
Figure BDA0003071456140000123
Ldet=a×(0.6×Lhm+0.15×Lhm_l+0.25×Lhm_s)+b×Loff+c×Lbbox
where parameters a, b, and c are constant term coefficients, where a is 1, b is 1, and c is 0.1, and w1 and w2 are learnable parameters.
(3f) Using a back propagation method and passing through the loss value LtotalCalculating the gradient of the weight parameter of the detection and re-identification integrated network model O, and then adopting a gradient descent algorithm to carry out the pair of the weight parameter theta through the gradient of the weight parameter of OJUpdating is carried out;
wherein the gradient of the weight parameter passing through O is opposite to the weight parameter thetaJUpdating, wherein the updating formula is as follows:
Figure BDA0003071456140000124
wherein:
Figure BDA0003071456140000125
indicating the updated network parameters and the updated network parameters,
Figure BDA0003071456140000126
representing the network parameter before update, alphaJThe step size is represented as a function of time,
Figure BDA0003071456140000127
representing the network parameter gradient of O.
(3g) Judging whether T is greater than T, if so, obtaining a trained detection and re-recognition integrated network model O', otherwise, making T equal to T +1, and executing the step (3 b);
step 4), acquiring a multi-target tracking result of the pedestrian:
(4a) initializing test sample set DtestThe kth test specimen is
Figure BDA0003071456140000131
Figure BDA0003071456140000132
Comprises P RGB image frames, the P-th RGB image frame is f(p)P is more than 200, k is equal to 1, and the historical track set Tra is initialized(k)={};
(4b) Let p be 1;
(4c) set of test samples
Figure BDA0003071456140000133
P-th RGB image frame f in (1)(p)Forward propagation as input to a trained detection and re-recognition integrated network model O' to yield f(p)Is predicted value Vec of the key point offsetoffsetDistance values Vec from the periphery of the target frame to the key points up, down, left and rightdis_bboxKey point heat map prediction result Hm and re-recognition feature vector VecreidAnd to Vecoffset、Vecdis_bboxAnd Hm to obtain f(p)Set of pedestrian detection frames Det ═ { Det ═ DetiI is more than or equal to |0 and less than or equal to DN-1}, wherein detiA detection box for the ith pedestrian, DN denotes f(p)Detecting the number of pedestrians;
(4d) screening out f(p)The detected key point response value confiGreater than a response threshold thconfObject of the pedestrian Object { Object ═ Object }i|confiIs greater than th, i is greater than or equal to 0 and is less than or equal to DN-1, and is collected in the Det and VecreidObtaining a detection frame and re-identification feature vector information corresponding to the pedestrian target from the vector;
wherein, in the Det set and VecreidThe implementation steps of obtaining the detection frame and the re-identification feature vector information corresponding to the pedestrian target in the vector are as follows:
(4d1) in the target objectiIndex the subscript in the Det set to obtain the detection frame DetiAnd with the detection frame detiIs indexed by the center point coordinates of (a) in VecreidInquiring in the vector to obtain the re-identification feature vector embedi
(4e) According to the detection frame and the re-recognition characteristic vector information of the screened pedestrian target, adopting an online correlation method to carry out on-line correlation on the Object and the Tra in the screened pedestrian target set(k)Carrying out data association to obtain f(p)The multi-target tracking result of the pedestrians is obtained;
the correlation method adopts the same online correlation method as in the FairMOT algorithm, and specifically comprises the following steps:
(4e1) attributes defining the trajectory: the ordered set of detection frames of each pedestrian under the tracking scene is called a track TraiAnd each track has the following attributes: information of current trajectory target box
Figure BDA0003071456140000134
State of track
Figure BDA0003071456140000135
Re-identified feature vectors for trajectories
Figure BDA0003071456140000141
Life span length of trajectory
Figure BDA0003071456140000142
Number of consecutive unmatched frames
Figure BDA0003071456140000143
Motion information
Figure BDA0003071456140000144
Information of track object box
Figure BDA0003071456140000145
I.e. the coordinates of the upper left corner and the lower right corner of the containing frame; state of track
Figure BDA0003071456140000146
Definition, track State
Figure BDA0003071456140000147
The method comprises three states of an active state, a lost state and an inactive state, wherein the track of the active state is the track matched with a detection frame in the previous frame; the missing state track is the number of the frames which are not matched with the detection frame but are continuously not matched in the previous frame
Figure BDA0003071456140000148
Does not exceed the life span
Figure BDA0003071456140000149
Number of consecutive unmatched frames
Figure BDA00030714561400001410
Exceeds the life cycle length
Figure BDA00030714561400001411
The trajectory of (a) is an inactive state trajectory; re-identified feature vectors for trajectories
Figure BDA00030714561400001412
Re-identifying characteristic vectors representing the appearance of the track target, calculating cosine similarity of the vectors between the track and the detection during correlation matching to judge the possibility that the track and the detection belong to the same track; life span length of trajectory
Figure BDA00030714561400001413
Namely, the maximum frame number limit threshold value is continuously unmatched, and the track is set to be in an inactive state when the maximum frame number limit threshold value is exceeded; acquisition and processing of motion information
Figure BDA00030714561400001414
By usingAnd the Kalman filtering algorithm is used for estimating the horizontal and vertical coordinates of the target center position of all the positions of the tracks, the aspect ratio and height of the current target frame and the speed variables of the four states, and updating the parameters of the Kalman filtering algorithm according to the final matching result.
(4e2) For all active state tracks and all lost state tracks, firstly estimating the coordinate positions of the track frames through a Kalman filtering algorithm, and then calculating the Markov distance Matrix between the track frames and all detection target frames of the current frameDisMotionFor matrix median greater than threshold thmdIs modified into an infinite value, and the rest position values are not changed to obtain the final motion prediction distance Matrix'DisMotionSimultaneously calculating cosine similarity distance Matrix between the track and the detected re-identification feature vectorDisEmbedFinally, the two are fused according to the following formula:
MatrixDis=λMatrixDisEmbed+(1-λ)Matrix′DisMotion
obtaining the final distance MatrixDisOptimizing and matching by adopting a Hungarian algorithm according to the distance matrix, and updating the track state;
(4e3) calculating the overlap proportion matrix of the unmatched active track and the detection frame in the last step, and finding out the matrix which has the maximum overlap proportion with the track and has the value larger than the threshold thiouThe detection of (3) matches it and updates the track state;
(4e4) for the track which is still not matched, if the track is in the active state, the state of the track is changed into the state of the track in the lost state, the number of the lost frames of the track is counted, if the track is judged to be in the lost state, the lost count is increased by one, and the number of the lost frames is judged to be or larger than the life cycle of the track
Figure BDA00030714561400001415
If the state is larger than the required state, setting the state as an inactive state; initializing the detection which is not matched into a new track;
(4e5) outputting current frame matching correlation information;
(4f) judging whether P is more than or equal to P, if so, obtaining a test sampleBook (I)
Figure BDA0003071456140000151
Otherwise, making p equal to p +1, and updating the historical track set Tra(k)And performing step (4 c);
(4g) judging whether K is more than or equal to K, if so, obtaining a test sample set DtestOtherwise, k is made to be k +1, and step (4b) is executed.
The effect of the present invention is further explained by combining the simulation experiment as follows:
1. simulation conditions and contents:
the hardware platform of the simulation experiment is as follows: the graphics card is configured as Nvidia RTX2080Ti × 2, the processor is configured as xeon (r) E5-2620 v4@2.10Ghz × 32, and the memory is configured as 64 GB.
The software platform of the simulation experiment is as follows: the operating system is Ubuntu16.04LTS, the Python version is 3.7, the Pythroch version is 1.2.0, and the OpenCV version is 3.4.0.
The integrated network provided by the invention is characterized in that a training image sequence data set used in a simulation experiment is a crowdHuman, ETH, CityPerson, CalTech, CUHK-SYSU, PRW and MOT17train data set, the integrated network is pre-trained for 60 generations on the crowdHuman data set, and then is trained for 30 generations on the other data sets to obtain test model parameters; and the test image sequence data set is an MOT17test data set, wherein the test image sequence data set comprises image sequences under 7 different tracking scenes, and comprises 5919 frame images and 785 pedestrian tracks.
The Tracking accuracy of the multi-target Tracking method disclosed in the present invention and the paper "FairMOT of On the Fairness of Detection and Re-Identification in Multiple Object Tracking" published in 2020 by Yifu Zhang et al was compared and simulated, and the results are shown in Table 1
2. And (3) simulation result analysis:
in order to evaluate the tracking accuracy, the following evaluation index (tracking accuracy MOTA) formula is used to calculate the accuracy of the tracking results of the present invention and the prior art respectively, and the calculation results are plotted as table 1:
Figure BDA0003071456140000152
table 1.
Figure BDA0003071456140000153
Figure BDA0003071456140000161
Wherein, FN is the number of false negative targets, FP is the number of false positive targets, IDSW is the number of identity switching times, and GT is the number of targets in the truth label.
As can be seen from Table 1, compared with the prior art, the tracking accuracy of the invention is improved by 0.8, which is obviously higher than that of the prior art.
The above simulation experiments show that: according to the method, when the integrated network model for detecting and re-identifying based on the multivariate difference fusion is constructed, the difference of training data, a training mode and a network structure is added, so that the two key point heat map prediction sub-networks form prediction preference on targets with different sizes, the prediction results of the two sub-networks are added and fused to obtain the multivariate difference fusion key point heat map, the problem of low detection recall rate caused by prediction of the sub-networks by only using a single key point heat map in the prior art is solved, and the tracking accuracy of the algorithm is improved.

Claims (6)

1. A pedestrian multi-target tracking method based on multivariate difference fusion is characterized by comprising the following steps:
(1) obtaining a training sample set DtrainAnd test sample set Dtest
(1a) Preprocessing the selected V RGB image sequences with the pedestrian detection frame labels and the identity labels to obtain a preprocessed RGB image frame sequence set
Figure FDA0003071456130000011
And mixing SvTaking RGB image frames contained in the I preprocessed RGB image frame sequences as a training sample set DtrainTaking the rest K preprocessed RGB image frame sequences as a test sample set DtestWherein
Figure FDA0003071456130000012
Denotes the mth one containing LmA sequence of frame pre-processed RGB image frames,
Figure FDA0003071456130000013
f(n)representing the n-th preprocessed RGB image frame, I > K, I + K ═ V, V > 20, Lm>200;
(2) Constructing a detection and re-identification integrated network model O based on multivariate difference fusion:
(2a) constructing a structure of a detection and re-identification integrated network model O based on multi-element difference fusion:
structure for constructing detection and re-identification integrated network model O, including backbone network NetbackboneAnd NetbackboneCascaded parallel-arranged and same-structure first feature fusion sub-network AsAnd a second feature fusion sub-network AlAnd a convergence module, wherein the first feature converges the subnetwork AsThe output end of the network is connected with a parallelly arranged key point deviation prediction sub-network NetoffsetSmall target preference keypoint heatmap prediction subnetwork Nethm_sAnd bounding Box prediction subnetwork NetbboxSecond feature fusion subnetwork AlThe output end of the network is connected with a large target preference key point heat map prediction sub-network Net which is arranged in parallelhm_lAnd re-identifying feature extraction sub-network NetreidWherein:
backbone network NetbackboneAdopting a tree-shaped polymerization iterative network consisting of a plurality of two-dimensional convolution layers, a plurality of batch normalization layers, a plurality of two-dimensional pooling layers, a plurality of deformable convolution layers and a plurality of transposition convolution layers;
first feature fusion subnet AsAnd a second feature fusion sub-network AlAll comprise a plurality ofIndividual spatial attention subnetwork NetsamAnd a channel attention subnetwork Netcam(ii) a Spatial attention subnetwork NetsamComprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, and a two-dimensional convolution layer connected with the two pooling layers, a channel attention sub-network NetcamThe system comprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, wherein the global average pooling layer and the global maximum pooling layer are respectively connected with two-dimensional convolutional layers in a cascade mode; netoffset、Nethm_s、Netbbox、Nethm_lAnd NetreidThe sub-networks all adopt a structure comprising a first convolutional layer, a rule active layer and a second convolutional layer which are sequentially cascaded, and NetreidThe output end is cascaded with a full connection layer, Nethm_sAnd Nethm_lThe output end is cascaded with the fusion module;
(2b) defining a loss function L for a keypoint heat map prediction taskheatmap
Figure FDA0003071456130000021
Where N represents the number of keypoints in the predicted keypoint heat map, alpha and beta represent hyper-parameters,
Figure FDA0003071456130000022
and YxyRespectively representing the labels and response values of key points at coordinates (x, y) in the predicted key point heat map, sigma representing summation operation, and log representing logarithm operation;
(3) carrying out iterative training on a detection and re-recognition integrated network model O based on multivariate difference fusion:
(3a) initializing detection and re-identifying integrated network model O with weight parameter thetaJThe iteration frequency is T, the maximum iteration frequency is T, T is more than or equal to 50000, and T is made to be 0;
(3b) for slave training sample set DtrainIn the random selection of bs ∈ [16,64 ]]Random data enhancement is carried out on each training sample, and the detection frame letter of each training sample is subjected to the enhancement modeUpdating information to obtain bs data enhanced training samples after the information of the detection frame is updated, and enabling the ratio of the height in the detection frame to the height of the image frame after the information is updated to be larger than a threshold thratioIs taken as a large target, and the ratio is less than a threshold thratioThe pedestrian target is used as a small target, and finally, the small target preference key point heat map label is determined according to the updated detection frame information, the updated identity information and the division result of the large target and the small target
Figure FDA0003071456130000023
Large target preference key point heat map label
Figure FDA0003071456130000024
Difference fusion key point heat map label
Figure FDA0003071456130000025
Bounding box labelbboxKey point offset labeloffsetRe-identification identity labelid
(3c) Using the bs training samples after data enhancement as the input of the detection and re-recognition integrated network model O, namely the backbone network NetbackboneExtracting features of each training sample to obtain three feature maps Feat with different scales of the training sample1、Feat2、Feat3
(3d) First feature fusion subnet AsFor Feat1、Feat2、Feat3Performing self-adaptive fusion to obtain a feature map FeatsThe keypoint shift prediction subnetwork NetoffsetSmall target preference keypoint heatmap prediction subnetwork Nethm_sAnd bounding Box prediction subnetwork NetbboxRespectively with FeatsForward reasoning is carried out for input to obtain NetoffsetCorresponding keypoint offset predictor vector Vecoffset、Nethm_sCorresponding keypoint heatmap predictions Hm _ S and Net of small target preferencesbboxCorresponding distance value vectors Vec from the key points to the upper, lower, left and right sides of the target framedis_bbox(ii) a All in oneTemporal second feature fusion subnetwork AlFor Feat1、Feat2、Feat3Performing self-adaptive fusion to obtain a feature map FeatlLarge target preference keypoint heat map prediction subnetwork Nethm_lAnd re-identifying feature extraction sub-network NetreidRespectively with FeatlFor input, forward reasoning is carried out to obtain Nethm_lCorresponding big target preferred key point heatmap prediction results Hm _ L and NetreidCorresponding re-identified feature vector VecreidFull connected layer pair VecreidClassifying to obtain a pedestrian identity classification result; the fusion module fuses the Hm _ S key point heat map and the Hm _ L key point heat map to obtain a fused key point heat map Hm;
(3e) using L1 loss function, through inputting the predicted value of the shift of the key point and the label thereofoffsetCalculating a loss value L of the keypoint shift prediction resultoffSimultaneously inputting the predicted value of the bounding box and label thereofbboxCalculating a loss value L for a bounding box predictorbboxAnd adopting cross entropy loss function, through inputting pedestrian identity classification result and label thereofidCalculating loss value L of re-recognition feature extraction resultreidThen, a loss function L of the task is predicted by adopting the key point heat mapheatmapRespectively inputting Hm _ S, Hm _ L and Hm and corresponding label
Figure FDA0003071456130000031
Figure FDA0003071456130000032
Calculating respective loss values Lhm_s、Lhm_l、LhmFinally to Loff、Lbbox、Lreid、Lhm_s、Lhm_lAnd LhmCarrying out self-adaptive weighted summation to obtain a loss value L of the detection and re-identification integrated network model Ototal
(3f) Using a back propagation method and passing through the loss value LtotalCalculating the gradient of the weight parameter of the integrated network model O for detection and re-identification, and thenUsing gradient descent algorithm, the weight parameter gradient of O is used to match the weight parameter thetaJUpdating is carried out;
(3g) judging whether T is greater than T, if so, obtaining a trained detection and re-recognition integrated network model O', otherwise, making T equal to T +1, and executing the step (3 b);
(4) acquiring a multi-target tracking result of the pedestrian:
(4a) initializing test sample set DtestThe kth test specimen is
Figure FDA0003071456130000041
Comprises P RGB image frames, the P-th RGB image frame is f(p)P is more than 200, k is equal to 1, and the historical track set Tra is initialized(k)={};
(4b) Let p be 1;
(4c) set of test samples
Figure FDA0003071456130000042
P-th RGB image frame f in (1)(p)Forward propagation as input to a trained detection and re-recognition integrated network model O' to yield f(p)Is predicted value Vec of the key point offsetoffsetDistance values Vec from the periphery of the target frame to the key points up, down, left and rightdis_bboxKey point heat map prediction result Hm and re-recognition feature vector VecreidAnd to Vecoffset、Vecdis_bboxAnd Hm to obtain f(p)Set of pedestrian detection frames Det ═ { Det ═ DetiI is more than or equal to |0 and less than or equal to DN-1}, wherein detiA detection box for the ith pedestrian, DN denotes f(p)Detecting the number of pedestrians;
(4d) screening out f(p)The detected key point response value confiGreater than a response threshold thconfObject of the pedestrian Object { Object ═ Object }i|confiIs greater than th, i is greater than or equal to 0 and is less than or equal to DN-1, and is collected in the Det and VecreidObtaining a detection frame and re-identification feature vector information corresponding to the pedestrian target from the vector;
(4e) according to the detection frame and the re-recognition characteristics of the screened pedestrian targetVector information, adopting an online correlation method to carry out on the screened pedestrian target set Object and the history track set Tra(k)Carrying out data association to obtain f(p)The multi-target tracking result of the pedestrians is obtained;
(4f) judging whether P is more than or equal to P, if so, obtaining a test sample
Figure FDA0003071456130000043
Otherwise, making p equal to p +1, and updating the historical track set Tra(k)And performing step (4 c);
(4g) judging whether K is more than or equal to K, if so, obtaining a test sample set DtestOtherwise, k is made to be k +1, and step (4b) is executed.
2. The pedestrian multi-target tracking method based on multivariate difference fusion as claimed in claim 1, wherein the preprocessing is performed on the selected V RGB image sequences with the pedestrian detection frame tags and the identity tags in step (1a), and the implementation steps are as follows:
(1a1) adjusting the size of each RGB image frame in each RGB image sequence by adopting a bilinear interpolation method to obtain an RGB image frame sequence set S with all the same RGB image frame sizesv′;
(1a2) Assembling a sequence of RGB image frames SvIn the method, the pedestrian detection frame label and the image after the scale change are updated synchronously, and meanwhile, the pedestrian identity label is uniformly coded, namely, for the data sample lacking identity information, the identity label is set to be-1, the coding mark is sequentially increased from 1 for the pedestrian with each individual identity, and the RGB image frame sequence set frame sequence after the RGB image frame is subjected to size adjustment and the detection frame label and the identity label are updated is obtained
Figure FDA0003071456130000051
The method and the device realize preprocessing of the V RGB image sequence with the pedestrian detection frame tag and the identity tag.
3. The multi-difference fusion-based pedestrian multi-target tracking method according to claim 1, wherein the multi-difference fusion-based detection and re-identification of the structure of the integrated network model O in step (2a) is performed, wherein:
backbone network NetbackboneThe number of the contained two-dimensional convolution layers, the batch normalization layer, the two-dimensional pooling layer, the deformable convolution layer and the transposition convolution layer is respectively 27, 37, 6, 4 and 2;
first feature fusion subnet AsAnd a second feature fusion sub-network AlEach containing three structurally identical spatial attention sub-networks NetsamThe NetsamThe convolution kernel size of the included two-dimensional convolution layer is 3x3, the step length is 1, and the output dimension is 1; channel attention subnetwork NetcamThe number of the two-dimensional convolution layers is 4, respectively, NetcamThe size of the convolution kernel of the included two-dimensional convolution layer is 1x1, and the step size is 1;
Netoffset、Nethm_s、Netbbox、Nethm_land NetreidThe convolution kernel size of the first convolutional layer and the second convolutional layer contained in the subnetwork is 3x3, with a step size of 1.
4. The multi-difference fusion-based pedestrian multi-target tracking method according to claim 1, wherein the pairs in the step (3b) are selected from a training sample set DtrainThe method comprises the following steps of carrying out random data enhancement on bs training samples randomly selected from the training samples, and updating the detection frame information of each training sample according to an enhancement mode, wherein the method comprises the following steps:
(3b1) carrying out random theta angle rotation on each training sample, wherein theta belongs to < -5 > and 5 >, carrying out random scale change on each training sample after the random angle rotation by taking s as a coefficient, wherein s belongs to 0.9 and 1.1, then carrying out random image brightness change operation on each training sample after the random scale change by taking r as a coefficient, and wherein r belongs to < -0.2 and 0.2, and obtaining bs training samples after random data enhancement;
(3b2) and synchronously updating the detection frame labels according to the values of theta and s to obtain bs training samples with enhanced data after the detection frame information is updated.
5. The multi-difference fusion-based pedestrian multi-target tracking method according to claim 1, wherein the step (3e) of detecting and re-identifying the loss value L of the integrated network model OtotalThe calculation formula is as follows:
Figure FDA0003071456130000061
Ldet=a×(0.6×Lhm+0.15×Lhm_l+0.25×Lhm_s)+b×Loff+c×Lbbox
where the parameters a, b, and c are constant term coefficients, and w1 and w2 are learnable parameters.
6. The multi-difference fusion-based pedestrian multi-target tracking method according to claim 1, wherein the gradient of the weight parameter passing through O in the step (3f) is relative to the weight parameter θJUpdating, wherein the updating formula is as follows:
Figure FDA0003071456130000062
wherein:
Figure FDA0003071456130000063
indicating the updated network parameters and the updated network parameters,
Figure FDA0003071456130000064
representing the network parameter before update, alphaJThe step size is represented as a function of time,
Figure FDA0003071456130000065
representing the network parameter gradient of O.
CN202110556574.7A 2021-05-18 2021-05-18 Pedestrian multi-target tracking method based on multi-element difference fusion Active CN113221787B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110556574.7A CN113221787B (en) 2021-05-18 2021-05-18 Pedestrian multi-target tracking method based on multi-element difference fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110556574.7A CN113221787B (en) 2021-05-18 2021-05-18 Pedestrian multi-target tracking method based on multi-element difference fusion

Publications (2)

Publication Number Publication Date
CN113221787A true CN113221787A (en) 2021-08-06
CN113221787B CN113221787B (en) 2023-09-29

Family

ID=77093689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110556574.7A Active CN113221787B (en) 2021-05-18 2021-05-18 Pedestrian multi-target tracking method based on multi-element difference fusion

Country Status (1)

Country Link
CN (1) CN113221787B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688761A (en) * 2021-08-31 2021-11-23 安徽大学 Pedestrian behavior category detection method based on image sequence
CN113723322A (en) * 2021-09-02 2021-11-30 南京理工大学 Pedestrian detection method and system based on single-stage anchor-free frame
CN113807187A (en) * 2021-08-20 2021-12-17 北京工业大学 Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion
CN114120188A (en) * 2021-11-19 2022-03-01 武汉大学 Multi-pedestrian tracking method based on joint global and local features
CN114241053A (en) * 2021-12-31 2022-03-25 北京工业大学 FairMOT multi-class tracking method based on improved attention mechanism
CN114663917A (en) * 2022-03-14 2022-06-24 清华大学 Multi-view-angle-based multi-person three-dimensional human body pose estimation method and device
CN114708653A (en) * 2022-03-23 2022-07-05 南京邮电大学 Specified pedestrian action retrieval method based on pedestrian re-identification algorithm
CN114937239A (en) * 2022-05-25 2022-08-23 青岛科技大学 Pedestrian multi-target tracking identification method and tracking identification device
CN115082748A (en) * 2022-08-23 2022-09-20 浙江大华技术股份有限公司 Classification network training and target re-identification method, device, terminal and storage medium
CN116912633A (en) * 2023-09-12 2023-10-20 深圳须弥云图空间科技有限公司 Training method and device for target tracking model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
US20190278378A1 (en) * 2018-03-09 2019-09-12 Adobe Inc. Utilizing a touchpoint attribution attention neural network to identify significant touchpoints and measure touchpoint contribution in multichannel, multi-touch digital content campaigns
CN111079658A (en) * 2019-12-19 2020-04-28 夸氪思维(南京)智能技术有限公司 Video-based multi-target continuous behavior analysis method, system and device
CN111767847A (en) * 2020-06-29 2020-10-13 佛山市南海区广工大数控装备协同创新研究院 Pedestrian multi-target tracking method integrating target detection and association
CN112131959A (en) * 2020-08-28 2020-12-25 浙江工业大学 2D human body posture estimation method based on multi-scale feature reinforcement

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
US20190278378A1 (en) * 2018-03-09 2019-09-12 Adobe Inc. Utilizing a touchpoint attribution attention neural network to identify significant touchpoints and measure touchpoint contribution in multichannel, multi-touch digital content campaigns
CN111079658A (en) * 2019-12-19 2020-04-28 夸氪思维(南京)智能技术有限公司 Video-based multi-target continuous behavior analysis method, system and device
CN111767847A (en) * 2020-06-29 2020-10-13 佛山市南海区广工大数控装备协同创新研究院 Pedestrian multi-target tracking method integrating target detection and association
CN112131959A (en) * 2020-08-28 2020-12-25 浙江工业大学 2D human body posture estimation method based on multi-scale feature reinforcement

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
侯建华;麻建;王超;项俊;: "基于空间注意力机制的视觉多目标跟踪", 中南民族大学学报(自然科学版), no. 04 *
张静;王文杰;: "基于多信息融合的多目标跟踪方法研究", 计算机测量与控制, no. 09 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807187A (en) * 2021-08-20 2021-12-17 北京工业大学 Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion
CN113807187B (en) * 2021-08-20 2024-04-02 北京工业大学 Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion
CN113688761B (en) * 2021-08-31 2024-02-20 安徽大学 Pedestrian behavior category detection method based on image sequence
CN113688761A (en) * 2021-08-31 2021-11-23 安徽大学 Pedestrian behavior category detection method based on image sequence
CN113723322A (en) * 2021-09-02 2021-11-30 南京理工大学 Pedestrian detection method and system based on single-stage anchor-free frame
CN114120188A (en) * 2021-11-19 2022-03-01 武汉大学 Multi-pedestrian tracking method based on joint global and local features
CN114120188B (en) * 2021-11-19 2024-04-05 武汉大学 Multi-row person tracking method based on joint global and local features
CN114241053A (en) * 2021-12-31 2022-03-25 北京工业大学 FairMOT multi-class tracking method based on improved attention mechanism
CN114241053B (en) * 2021-12-31 2024-05-28 北京工业大学 Multi-category tracking method based on improved attention mechanism FairMOT
CN114663917A (en) * 2022-03-14 2022-06-24 清华大学 Multi-view-angle-based multi-person three-dimensional human body pose estimation method and device
CN114708653A (en) * 2022-03-23 2022-07-05 南京邮电大学 Specified pedestrian action retrieval method based on pedestrian re-identification algorithm
CN114937239A (en) * 2022-05-25 2022-08-23 青岛科技大学 Pedestrian multi-target tracking identification method and tracking identification device
CN115082748A (en) * 2022-08-23 2022-09-20 浙江大华技术股份有限公司 Classification network training and target re-identification method, device, terminal and storage medium
CN115082748B (en) * 2022-08-23 2022-11-22 浙江大华技术股份有限公司 Classification network training and target re-identification method, device, terminal and storage medium
CN116912633A (en) * 2023-09-12 2023-10-20 深圳须弥云图空间科技有限公司 Training method and device for target tracking model
CN116912633B (en) * 2023-09-12 2024-01-05 深圳须弥云图空间科技有限公司 Training method and device for target tracking model

Also Published As

Publication number Publication date
CN113221787B (en) 2023-09-29

Similar Documents

Publication Publication Date Title
CN113221787B (en) Pedestrian multi-target tracking method based on multi-element difference fusion
Adarsh et al. YOLO v3-Tiny: Object Detection and Recognition using one stage improved model
CN107229904B (en) Target detection and identification method based on deep learning
CN111027493B (en) Pedestrian detection method based on deep learning multi-network soft fusion
CN107633226B (en) Human body motion tracking feature processing method
CN112396002A (en) Lightweight remote sensing target detection method based on SE-YOLOv3
CN110443805B (en) Semantic segmentation method based on pixel density
CN109977997B (en) Image target detection and segmentation method based on convolutional neural network rapid robustness
CN109543606A (en) A kind of face identification method that attention mechanism is added
WO2020046213A1 (en) A method and apparatus for training a neural network to identify cracks
CN110826379B (en) Target detection method based on feature multiplexing and YOLOv3
CN111968150B (en) Weak surveillance video target segmentation method based on full convolution neural network
CN111626184B (en) Crowd density estimation method and system
Kim et al. Fast pedestrian detection in surveillance video based on soft target training of shallow random forest
CN114821014B (en) Multi-mode and countermeasure learning-based multi-task target detection and identification method and device
CN110647802A (en) Remote sensing image ship target detection method based on deep learning
CN110322445A (en) A kind of semantic segmentation method based on maximization prediction and impairment correlations function between label
KR20180123810A (en) Data enrichment processing technology and method for decoding x-ray medical image
CN114241250A (en) Cascade regression target detection method and device and computer readable storage medium
CN115018039A (en) Neural network distillation method, target detection method and device
CN112801047A (en) Defect detection method and device, electronic equipment and readable storage medium
CN118212572A (en) Road damage detection method based on improvement YOLOv7
CN114283326A (en) Underwater target re-identification method combining local perception and high-order feature reconstruction
CN116977859A (en) Weak supervision target detection method based on multi-scale image cutting and instance difficulty
CN117437691A (en) Real-time multi-person abnormal behavior identification method and system based on lightweight network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant