CN110288627A

CN110288627A - One kind being based on deep learning and the associated online multi-object tracking method of data

Info

Publication number: CN110288627A
Application number: CN201910429444.XA
Authority: CN
Inventors: 陈小波; 冀建宇; 王彦钧; 蔡英凤; 王海; 陈龙
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2019-05-22
Filing date: 2019-05-22
Publication date: 2019-09-27
Anticipated expiration: 2039-05-22
Also published as: CN110288627B

Abstract

The invention discloses one kind to be based on deep learning and the associated online multi-object tracking method of data, includes the following steps: the image of 1, input video present frame；2, application target detector obtains detection response all in image；3, the external appearance characteristic of depth cosine metric learning model extraction detection response is utilized；4, initialized target state；5, the position and scale using Kalman filtering algorithm prediction target in next frame；6, target is associated with based on two stages data correlation with the matching of detection response, obtains optimal association results；7, according to the state and feature of the optimal association results more fresh target in step 6；8, the image of next video frame is inputted, step 2,3,4,5,6,7 are repeated, until video terminates.Compared with prior art, the present invention can realize the correct association between target, complete robust and lasting multiple target tracking in the case where target interaction and blocking, having the complex situations such as similar appearance between target.

Description

One kind being based on deep learning and the associated online multi-object tracking method of data

Technical field

It is the present invention relates to a kind of method for tracking target, in particular to a kind of associated online more based on deep learning and data Method for tracking target belongs to computer vision field.

Background technique

Multitarget Tracking is an especially important branch in computer vision field, is widely used in various Video analysis scene, such as autonomous driving vehicle, robot navigation, Intelligent traffic video monitoring and motion analysis etc..

The task of online multiple target tracking is reliably to estimate position and the same target of across frame tracking of target frame by frame Estimate the track of multiple targets.In recent years, due to the development of deep learning, the performance of algorithm of target detection is constantly promoted, detection Respond relatively reliable, tracking (Tracking-by-detection) frame based on detection receives significant attention, and achieves aobvious The effect of work becomes the mainstream of current multiple target tracking.Under this tracking frame, target inspection good using off-line training first It surveys device independently to detect the target in every frame image, obtains the number and location of target, then, according to the outer of target The information such as sight, movement, the target detected in consecutive frame is associated, realizes the matching and tracking of target.Based on detection Track algorithm can be divided into two classes: off-line tracking and online tracking.

Currently, the track algorithm based on detection is also faced with lot of challenges, tracking effect depends critically upon the property of detector Can, in complicated scene, when serious block occurs between target and barrier or target, multiple target tracking algorithm It is easy to that entanglement occurs with losing target or target designation.Secondly, object detector detection noise and target scale it is violent Variation also results in multiple target tracking algorithm and tracking drift occurs.

Summary of the invention

Goal of the invention: when mutually being blocked for the target in complex scene with similar appearance, existing multiple target tracking There is the problems such as serious number switching, tracking drift in technology, it is associated based on deep learning and data that the invention proposes one kind Online multi-object tracking method.

The invention proposes a kind of new multi-object tracking methods, solve the problems, such as multiple target tracking from multiple angles.1) it adopts With the depth cosine metric learning modelling display model of target, spy is extracted from target image using multilayer convolutional network Sign realizes effective identification of different target appearance using the cosine between feature vector as the similitude between target appearance；2) In view of the continuity of target appearance dynamic change, a kind of target appearance similarity measurements for merging multiframe history external appearance characteristic are constructed Amount method, defect or target that detector can be effectively relieved mutually block the influence to object matching precision；3) it proposes to be based on mesh The two stages data correlation method of mark state separately designs corresponding associating policy for the reliability of target, and uses breast tooth Sharp algorithm carries out data correlation.It is crowded, frequently block the vehicles in complex traffic scene of generation under, which is able to achieve is accurate, stablizes Multiple target tracking.

Technical solution: one kind being based on deep learning and the associated online multi-object tracking method of data, which is characterized in that institute The method of stating includes the following steps:

Step 1: the image of input video present frame；

Step 2: application target detector obtains the set D of all detection responses in image^t={ D₁, D₂..., D_M, t is Current frame number, D_jFor j-th of detection response, it is expressed asWhereinD is responded for detection_j's Center point coordinate,D is responded for detection_jWidth and height, M be detection response sum；

Step 3: utilizing depth cosine metric learning model from detection response sets D^tIn all detections response extract it is outer Feature vector is seen, { Z is expressed as₁, Z₂..., Z_M, wherein Z_j∈R^pD is responded for detection_jExternal appearance characteristic；

Step 4: dbjective state is divided into 4 classes by initialized target state: original state, tracking mode, lost condition and being deleted Except state；If t=1, i.e. the first frame of input video generates target collection T^t={ T₁, T₂..., T_N, N=M, target T_jWith Detection response D_jIt is corresponding, and by target T_jState be set to original state, go to step 1；Otherwise, step 5 is gone to；

Step 5: applying Kalman filtering algorithm, predict target collection T^t-1In each target T_iPosition in the current frame It sets and scale, is expressed asWhereinFor the center point coordinate of prediction,For prediction Width and height；

Step 6: target being associated with detection response matching based on two stages data correlation, obtains optimal association results；

Step 7: according to the state and feature of the optimal association results more fresh target in step 6；

Step 8: inputting the image of next video frame, repeat step 2,3,4,5,6,7 until video terminates.

Preference, dbjective state of the step 6 based on two stages data correlation are associated with the matching of detection response, are wrapped It includes:

(a) state based on targets all in former frame, by target collection T^t-1={ T₁, T₂..., T_NIt is divided into two classes Ω₁And Ω₂, Ω₁∪Ω₂=T^t-1, Ω₁It is made of the target in original state and tracking mode, Ω₂By being in lost condition Target composition, N be target sum；

(b) Ω is calculated₁In each target and D^tIn each detection response matching similarity, obtain similarity matrix A₁；With-A₁To be associated with cost matrix, by Ω₁In target and D^tIn detection response be associated, asked using Hungary Algorithm Solve optimal association；According to association results by Ω₁With D^tIt is divided:WhereinIn Target and D^AIn detection respond successful association,For not associated target collection, D^BFor the detection that the first stage is not associated Response sets；

(c) Ω is calculated₂In each target and D^BIn each detection response matching similarity, obtain similarity matrix A₂；With-A₂To be associated with cost matrix, by Ω₂In target and D^BIn detection response be associated, asked using Hungary Algorithm Solve optimal association.According to association results by Ω₂With D^BIt is divided: Wherein In target with For not associated target collection,It is not associated for second stage Detection response sets.

Preference, the method calculate Ω₁In each target and D^tIn each detection response matching similarity, packet It includes:

(a) Ω is calculated₁In target T_iWith D^tIn detection respond D_jAppearance similarity degree

And

Wherein<*, *>be vector inner product, X_i(t-K) target T is indicated_iExternal appearance characteristic vector in t-k frame, Z_jTable Show detection response D_jExternal appearance characteristic vector, ω_kIndicate external appearance characteristic vector X_i(t-k) weight, C_iIt (t-k) is target T_i? The matching cost of t-k frame and detection response；

(b) Ω is calculated₁In target T_iWith D^tIn detection respond D_jShape similarity

(c) Ω is calculated₁In target T_iWith D^tIn detection respond D_jKinematic similitude degree

For target T_iEstimation rangeD is responded with detection_jCorresponding regionFriendship and than (IOU), wherein area (*) indicates area；

(d) Ω is calculated₁In target T_iWith D^tIn detection respond D_jMatching similarity A₁(i, j):

Preference, the method calculate Ω₂In each target and D^BIn each detection response matching similarity, packet It includes:

(a) Ω is calculated using above-mentioned formula (1), (2), (3)₂In target T_iWith D^BIn detection respond D_jAppearance phase Like degreeAnd shape similarity

(b) target T is calculated_iSearch radius r_i:

WhereinFor current frame number and target T_iThe difference of maximum frame number when in tracking mode, α are constant.With mesh Mark T_iPredicted position in the current frameCentered on, r_iFor radius, target T is defined_iRegion of search R_i；

(c) Ω is calculated₂In target T_iWith detection response sets D^BIn detection respond D_jMatching similarity A₂(i, j):

Wherein I (R_i∩D_j> 0) it is indicator function, as region of search R_iD is responded with detection_jWhen in the presence of overlapping, I (R_i∩D_j > 0)=1, otherwise I (R_i∩D_j> 0)=0.

Preference, the step 7: according to the state and feature of the optimal association results more fresh target in step 6, comprising:

(a) forIn not associated detection response, indicate to be likely to occur fresh target in video, initialize fresh target, And state is set to original state.When f continuously occurs in the target of original state_initFrame is then Target Assignment ID, and state is arranged Then target is converted to tracking mode by parameter；

(b) forIn target, due to keeping dbjective state constant, using Kalman there are associated detection response The state of filtering algorithm more fresh target, and target is saved in the external appearance characteristic vector of present frame；

(c) forIn target dbjective state is converted to by tracking mode since no associated detection responds Lost condition, and target is saved in the external appearance characteristic vector of present frame；

(d) forIn target, since there are associated detection response, dbjective state being converted to by lost condition Tracking mode using the state of Kalman filtering algorithm more fresh target, and saves target in the external appearance characteristic vector of present frame；

(e) forIn target keep dbjective state constant since no associated detection responds；

(f) as the continuous f of target_delFrame is in lost condition, then is converted to deletion state, and destroys the target.

The utility model has the advantages that the 1, present invention is utilized by using the display model of depth cosine metric learning model learning target Multilayer convolutional network extracts feature from target image, using the cosine between feature vector as similar between target appearance Property, it realizes effective identification of different target appearance, effectively overcomes the target in complex scene with similar appearance in interaction When caused ID switching problem；2, the present invention considers the continuity of target appearance dynamic change, more by constructing a kind of fusion The target appearance method for measuring similarity of frame history external appearance characteristic effectively alleviates detector defect or target and mutually blocks pair The influence of matching precision；3, the present invention is by using the two stages data correlation method based on dbjective state, not for target Corresponding associating policy is separately designed with state, and data correlation is carried out using Hungary Algorithm, is effectively alleviated due to number (Fragment) problem is broken according to track caused by association failure.

Detailed description of the invention

Fig. 1 is flow chart of the invention；

Fig. 2 is the frame of depth cosine metric learning model of the invention；

Fig. 3 is that dbjective state of the invention shifts figure.

Specific embodiment

Technical solution of the present invention is further explained in detail below in conjunction with attached drawing and specific embodiment, with For line pedestrian's multiple target tracking, but the scope of the present invention is not limited to following embodiments.

Off-line training step:

Off-line training depth cosine metric learning model:

Given training sample set { (x_i, y_i), i=1,2,3 ..., L }, wherein x_i∈R^128×64For the pedestrian after standardization Image, y_i∈ { 1,2,3 ..., K } is corresponding pedestrian's class label, and L is training sample number.Depth cosine metric learning mould Type learns a feature extraction function f (x) from training sample, and input pedestrian image is mapped in insertion feature space, Then cosine softmax classifier is applied in insertion feature space, maximizes the posterior probability of classification.Cosine softmax points Class device is defined as follows:

WhereinFor normalized weight vectors, ω_kFor the weight vectors of kth class, τ is calibrating parameters, f It (x) is the feature vector extracted from image, f (x) has unit length.Due toUnit length is all had with f (x), in formula 'sIt is expressed as pressing from both sides cosine of an angle between two vectors, by maximizing posterior probability P (y=k | f (x)), can reduce often Angle between the corresponding weight vectors of class target.

For training the cross entropy loss function of depth cosine metric learning model are as follows:

Wherein I (y_i=k) it is indicator function, work as y_iWhen=k, I (y_i=k)=1, otherwise I (y_i=k)=0.

In the present embodiment, feature extraction function f (x) is realized using convolutional neural networks CNN, the structure of CNN such as Fig. 2 institute Show, input image size is 128 × 64, and output feature vector length is 128, and every layer of activation primitive is index linear unit (ELU).Using the above-mentioned network of pedestrian image training in Market-1501 database, and net is updated using Adam optimization method Network parameter.

Online pedestrian's multiple target tracking stage:

Specifically, as shown in Figure 1, the invention proposes one kind based on deep learning and the associated online multiple target of data with Track method, steps are as follows for the key technology of this method:

Step 1: the image of input video present frame；

Step 2: obtaining the set D of all detection responses in image using detector^t={ D₁, D₂..., D_M, t is current Frame number, D_jFor j-th of detection response, it is expressed asWhereinD is responded for detection_jCentral point Coordinate,D is responded for detection_jWidth and height, M be detection response sum；

In the present embodiment, the pedestrian detector used is DPM (Deformable Parts Model).

Step 3: using the good depth cosine metric learning model of above-mentioned off-line training from detection response sets D^tIn institute There is detection response to extract external appearance characteristic vector, is expressed as { Z₁, Z₂..., Z_M, wherein Z_j∈R^pD is responded for detection_jThe appearance of extraction Feature；

Step 4: initialized target state.Dbjective state is divided into 4 classes: original state, tracking mode and is deleted lost condition Except state.If t=1, i.e. the first frame of input video generates target collection T^t={ T₁, T₂..., T_N, N=M, target T_jWith Detection response D_jIt is corresponding, and by target T_jState be set to original state, go to step 1.Otherwise, step 5 is gone to.

Step 5: applying Kalman filtering algorithm, predict target collection T^t-1In each target T_jPosition in the current frame It sets and scale, is expressed asWhereinFor the center point coordinate of prediction,For prediction Width and height；

6.1: the state based on targets all in former frame, by target collection T^t-1={ T₁, T₂..., T_NIt is divided into two classes Ω₁And Ω₂, Ω₁∪Ω₂=T^t-1, Ω₁It is made of the target in original state and tracking mode, Ω₂By being in lost condition Target composition, N be target sum；

6.2: calculating Ω₁In each target and D^tIn each detection response matching similarity, obtain similarity matrix A₁, with-A₁To be associated with cost matrix, by Ω₁In target and D^tIn detection response be associated, asked using Hungary Algorithm Solve optimal association；According to association results by Ω₁With D^tIt is divided:D^t=D^A∪D^B, whereinIn mesh Mark and D^AIn detection respond successful association,For not associated target collection, D^BFor the detection response that the first stage is not associated Set.Calculate similarity matrix A₁Specific step is as follows:

And

Wherein<*, *>be vector inner product, X_i(t-K) external appearance characteristic vector of the target Ti in t-k frame, Z are indicated_jTable Show detection response D_jExternal appearance characteristic vector, ω_kIndicate external appearance characteristic vector X_i(t-k) weight, C_iIt (t-k) is target T_i? The matching cost of t-k frame and detection response.

In the present embodiment, history external appearance characteristic of the target in nearest 6 frame, i.e. K=6 are saved.

For target T_iEstimation rangeD is responded with detection_jCorresponding regionFriendship and than (IOU), wherein area (*) indicates area.

6.3: calculating Ω₂In each target and D^BIn each detection response matching similarity, obtain similarity matrix A₂；With-A₂To be associated with cost matrix, by Ω₂In target and D^BIn detection response be associated, asked using Hungary Algorithm Solve optimal association.According to association results by Ω₂With D^BIt is divided: Wherein In target withIn detection respond successful association,For not associated target collection,It is not associated for second stage Detection response sets.Calculate similarity matrix A₂Specific step is as follows:

(a) Ω is calculated using above-mentioned formula (3), (4), (5)₂In target T_iWith D^BIn detection respond D_jAppearance phase Like degreeShape similarity

(b) target T is calculated_iSearch radius r_i:

In the present embodiment, α takes 0.15.

WhereinThe difference of maximum frame number when being in tracking mode for current frame number and target Ti, α is constant.

(c) with target T_iPredicted position in the current frameCentered on, r_iFor radius, target T is defined_iSearch Rope region R_i。

(d) Ω is calculated₂In target T_iWith D^BIn detection respond D_jMatching similarity A₂(i, j):

Wherein I (R_i∩D_j> 0) it is indicator function, when detection responds D_jWith region of search R_iWhen in the presence of overlapping, I (R_i∩D_j > 0)=1, otherwise I (R_i∩D_j> 0)=0.

Step 7: as shown in figure 3, according to the state and feature of the optimal association results more fresh target in step 6, it is specific to walk It is rapid as follows:

(a) forIn not associated detection response, indicate to be likely to occur fresh target in video, initialize fresh target, And state is set to original state.When f continuously occurs in the target of original state_initFrame is then Target Assignment ID, and state is arranged Then target is converted to tracking mode by parameter.

(b) forIn target, due to keeping dbjective state constant, using Kalman there are associated detection response The state of filtering algorithm more fresh target, and target is saved in the external appearance characteristic vector of present frame.

(c) forIn target dbjective state is converted to by tracking mode since no associated detection responds Lost condition, and target is saved in the external appearance characteristic vector of present frame.

(d) forIn target, since there are associated detection response, dbjective state being converted to by lost condition Tracking mode using the state of Kalman filtering algorithm more fresh target, and saves target in the external appearance characteristic vector of present frame.

(e) forIn target keep dbjective state constant since no associated detection responds.

In the present embodiment, f is taken_init=3, f_del=20.

Implementation result:

According to above-mentioned steps, we have carried out reality on the MOT16 data set of multiple target tracking challenge MOT Challenge It tests.All experiments all realize in PC machine, the major parameter of the PC machine are as follows: central processing unit Intel Core i7 2.3GHz, 16G memory.Algorithm is realized with Python.

The results show that the technical program can effectively track the pedestrian being detected in video, block as pedestrian or There are lasting tracking is also able to achieve when detection noise, the correct track of target is exported.Moreover, program operational efficiency is higher, about 1 Second it can handle 10 frame input pictures.This experiment shows that the multiple target tracking algorithm of the present embodiment can be realized accurately and rapidly Online pedestrian tracking.

To sum up, the invention proposes one kind to be based on deep learning and the associated online multi-object tracking method of data. This method is widely used in the target following under various video scenes, such as the pedestrian tracking under video monitoring scene, pacifies for wisdom Anti- system provides the vehicle tracking under technical support and vehicles in complex traffic scene, provides technical support for automatic Pilot technology.This Invention follows the tracking frame based on detection, data correlation problem is converted by online multiple target tracking problem, first with instruction The object detector perfected extracts all detections response in image；Then utilize depth cosine metric learning model from each inspection It surveys response and extracts external appearance characteristic vector；Different target and detection are calculated in conjunction with clues such as the appearance of target, movement and shapes Association cost between response；The Optimum Matching of target and detection is realized using Hungary Algorithm in two stages data correlation, Finally dbjective state is updated according to association results.

Particular embodiments described above has carried out further background of the invention, technical scheme and beneficial effects It is described in detail.As it will be easily appreciated by one skilled in the art that the foregoing is merely a specific embodiments of the invention, not For limiting the scope of protection of the present invention completely.Note that those skilled in the art, it is all in spirit and original of the invention Any modification, equivalent substitution, improvement and etc. done within then, should all be included in the protection scope of the present invention.

Claims

1. one kind is based on deep learning and the associated online multi-object tracking method of data, which is characterized in that the method includes Following steps:

Step 1: the image of input video present frame；

Step 2: application target detector obtains the set D of all detection responses in image^t={ D₁, D₂..., D_M, t is present frame Number, D_jFor j-th of detection response, it is expressed asWhereinD is responded for detection_jCentral point sit Mark,D is responded for detection_jWidth and height, M be detection response sum；

Step 3: utilizing depth cosine metric learning model from detection response sets D^tIn all detections response extract external appearance characteristic Vector is expressed as { Z₁, Z₂..., Z_M, wherein Z_j∈R^pD is responded for detection_jExternal appearance characteristic；

Step 4: dbjective state is divided into 4 classes: original state, tracking mode, lost condition and deletion shape by initialized target state State；If t=1, i.e. the first frame of input video generates target collection T^t={ T₁, T₂..., T_N, N=M, target T_jWith detection Respond D_jIt is corresponding, and by target T_jState be set to original state, go to step 1；Otherwise, step 5 is gone to；

Step 5: applying Kalman filtering algorithm, predict target collection T^t-1In each target T_iPosition and ruler in the current frame Degree, is expressed asWhereinFor the center point coordinate of prediction,For the width and height of prediction；

2. it is according to claim 1 a kind of based on deep learning and the associated online multi-object tracking method of data, it is special Sign is that dbjective state of the step 6 based on two stages data correlation is associated with the matching of detection response, comprising:

(a) state based on targets all in former frame, by target collection T^t-1={ T₁, T₂..., T_NIt is divided into two class Ω₁With Ω₂, Ω₁∪Ω₂=T^t-1, Ω₁It is made of the target in original state and tracking mode, Ω₂By the target for being in lost condition Composition, N are target sum；

(b) Ω is calculated₁In each target and D^tIn each detection response matching similarity, obtain similarity matrix A₁； With-A₁To be associated with cost matrix, by Ω₁In target and D^tIn detection response be associated, most using Hungarian Method Excellent association；According to association results by Ω₁With D^tIt is divided:D^t=D^A∪D^B, whereinIn target with D^AIn detection respond successful association,For not associated target collection, D^BCollection is responded for first stage not associated detection It closes；

(c) Ω is calculated₂In each target and D^BIn each detection response matching similarity, obtain similarity matrix A₂； With-A₂To be associated with cost matrix, by Ω₂In target and D^BIn detection response be associated, most using Hungarian Method Excellent association.According to association results by Ω₂With D^BIt is divided:WhereinIn Target withIn detection respond successful association,For not associated target collection,For the not associated inspection of second stage Survey response sets.

3. it is according to claim 2 a kind of based on deep learning and the associated online multi-object tracking method of data, it is special Sign is that the method calculates Ω₁In each target and D^tIn each detection response matching similarity, comprising:

And

Wherein<*, *>be vector inner product, X_i(t-K) target T is indicated_iExternal appearance characteristic vector in t-k frame, Z_jIndicate inspection Survey response D_jExternal appearance characteristic vector, ω_kIndicate external appearance characteristic vector X_i(t-k) weight, C_iIt (t-k) is target T_iIn t-k The matching cost of frame and detection response；

For target T_iEstimation rangeD is responded with detection_jCorresponding region Friendship and than (IOU), wherein area (*) indicates area；

4. it is according to claim 2 a kind of based on deep learning and the associated online multi-object tracking method of data, it is special Sign is that the method calculates Ω₂In each target and D^BIn each detection response matching similarity, comprising:

(a) Ω is calculated using above-mentioned formula (1), (2), (3)₂In target T_iWith D^BIn detection respond D_jAppearance similarity degreeAnd shape similarity

(b) target T is calculated_iSearch radius r_i:

WhereinFor current frame number and target T_iThe difference of maximum frame number when in tracking mode, α are constant.With target T_i Predicted position in the current frameCentered on, r_iFor radius, target T is defined_iRegion of search R_i；

Wherein I (R_i∩D_j> 0) it is indicator function, as region of search R_iD is responded with detection_jWhen in the presence of overlapping, I (R_i∩D_j> 0) =1, otherwise I (R_i∩D_j> 0)=0.

5. it is according to claim 1 a kind of based on deep learning and the associated online multi-object tracking method of data, it is special Sign is, the step 7: according to the state and feature of the optimal association results more fresh target in step 6, comprising:

(a) forIn not associated detection response, indicate to be likely to occur fresh target in video, initialize fresh target, and by shape State is set to original state.When f continuously occurs in the target of original state_initFrame is then Target Assignment ID, and state parameter is arranged, Then target is converted into tracking mode；

(b) forIn target, due to keeping dbjective state constant, using Kalman filtering there are associated detection response The state of algorithm more fresh target, and target is saved in the external appearance characteristic vector of present frame；

(c) forIn target dbjective state is converted into loss by tracking mode since no associated detection responds State, and target is saved in the external appearance characteristic vector of present frame；

(d) forIn target, since there are associated detection response, dbjective state is converted to tracking by lost condition State using the state of Kalman filtering algorithm more fresh target, and saves target in the external appearance characteristic vector of present frame；