CN109360226B

CN109360226B - Multi-target tracking method based on time series multi-feature fusion

Info

Publication number: CN109360226B
Application number: CN201811210852.8A
Authority: CN
Inventors: 田胜; 陈丽琼; 邹炼; 范赐恩; 杨烨; 胡雨涵
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2018-10-17
Filing date: 2018-10-17
Publication date: 2021-09-24
Anticipated expiration: 2038-10-17
Also published as: CN109360226A

Abstract

The invention provides a multi-target tracking method based on time series multi-feature fusion. The method comprises the steps of obtaining the category and the candidate frame of a tracking target according to a multi-target detection algorithm; calculating a motion prediction center point and screening candidate frames by using a convolution network and a correlation filter; calculating an appearance similarity score; calculating a motion similarity score; calculating an interactive feature similarity score; converting the candidate frame in the tracking frame of the current frame image after screening, and updating the characteristic information of the tracking target; calculating a moving prediction central point of a tracking target which is not matched with the candidate frame and screening the candidate frame; associating unmatched candidate frames with the existing tracking target to construct a new tracking target; calculating the overlapping degree of each tracking target by adopting an intersection ratio; and identifying the tracking target continuously in the lost state in the multi-frame image as the lost target. Compared with the prior art, the invention improves the tracking precision.

Description

Multi-target tracking method based on time series multi-feature fusion

Technical Field

The invention relates to the technical field of computer vision and target tracking, in particular to a multi-target tracking method based on time series multi-feature fusion.

Background

The target tracking means that in an image sequence, a target which is interested in a system is detected, the target is accurately positioned, and then the motion information of the target is continuously updated in the moving process of the target, so that the target is continuously tracked. The target tracking can be divided into multi-target tracking and single-target tracking, the single-target tracking only focuses on one interested target, the task is to design a motion model or an appearance model to solve the influence of factors such as scale transformation, target shielding, illumination and the like, and the image position corresponding to the interested target is calibrated frame by frame. Compared with single target tracking, multi-target tracking also needs to solve two additional tasks: discovering and processing newly appearing and disappearing objects in the video sequence; individual target-specific identities are maintained.

Initialization of tracking targets, frequent occlusion, target leaving detection area, similar appearance of multiple targets, and interaction between multiple targets all add difficulty to multi-target tracking. In order to timely judge newly appearing targets and disappearing targets, multi-target tracking algorithms often need multi-target detection as the basis for algorithm implementation.

In recent years, with the development of deep learning, the development of the computer vision field is very rapid. The target detection algorithm is very accurate and has higher processing speed. However, in the field of multi-target tracking, because the difficulty of multi-target tracking is not completely solved, the data association algorithm based on detection still has a great promotion space. The invention has the innovation points that the positions of all targets are predicted by using a related filtering algorithm, the dependence degree of a detection algorithm is reduced, an LSTM (Long Short-Term Memory) network framework based on the positions, appearances, motions and interactive multi-features of objects is provided, the problem of multi-target shielding is solved by extracting a feature model with high discrimination degree, and the precision of multi-target tracking is improved.

At present, a popular mode in the field of multi-target tracking is a data association algorithm depending on a detector, the method well solves the problems of target initialization, extinction, scale transformation and the like, but still cannot well solve the problems of excessive dependence on the performance of the detector, mutual shielding among multiple targets, target areas with similar appearances and the like.

Disclosure of Invention

In order to solve the technical problem, the invention provides a multi-target tracking method based on time series multi-feature data association.

The technical scheme of the invention is a multi-target tracking method based on time series multi-feature data association, which specifically comprises the following steps:

step 1: detecting a tracking target in the frame image according to an SSD multi-target detection algorithm, and counting the category of the tracking target and a candidate frame of the tracking target by comparing the confidence coefficient of the SSD detected tracking target with a confidence coefficient threshold;

step 2: extracting convolution characteristics of a tracking target in a position frame of a current frame by using a convolution network, calculating a response confidence score of each position in the current frame image through a correlation filter of the tracking target, defining a point with the highest score as a mobile prediction central point of the tracking target under the current frame image, and screening candidate frames through the mobile prediction central point;

and step 3: calculating appearance similarity scores of the tracking target in the tracking state or the lost state and the screened candidate frame;

and 4, step 4: calculating the motion similarity scores of the tracking target in the tracking state or the lost state and the screened candidate frame;

and 5: calculating interactive feature similarity scores of the tracking target in the tracking state or the lost state and the screened candidate frames;

step 6: if the tracking target in the tracking state or the lost state is matched with the candidate frame, comparing the total similarity score with a matching score threshold value, when the total similarity score is greater than the matching score threshold value, converting the candidate frame into a tracking frame of the tracking target in the current frame image, and updating the appearance characteristic, the speed characteristic and the interactive characteristic information of the tracking target; if the tracking target in the tracking state or the lost state is not matched with the candidate frame, updating the state information of the tracking target through the step 2;

and 7: associating unmatched candidate frames with existing tracking targets, determining the unmatched candidate frames as new tracking targets, initializing the new tracking targets, establishing the new tracking targets, constructing a position feature model, an appearance feature model, a speed feature model and an interaction feature model of the new tracking targets, updating the states of the position feature model, the appearance feature model, the speed feature model and the interaction feature model into tracking states, and performing data association matching tracking in subsequent frame images;

and 8: retrieving all tracking targets in a tracking state of the current frame again, and calculating the overlapping degree of all tracking targets by adopting an intersection ratio;

and step 9: and recognizing the tracking target continuously in a lost state in the continuous multi-frame images as a disappeared target, storing the data information of the tracking state, and not performing data matching operation on the tracking target.

Preferably, the frame image in step 1 is an m-th image, and the number of categories of the tracking target in step 1 is N_mIn step 1, the candidate frame of the tracking target is:

D_i,m＝{x_i,m∈[l_i,m,l_i,m+lenth_i,m],y_i,m∈[w_i,m,w_i,m+width_i,m]|(x_i,m,y_i,m)},i∈[1,K_m]

wherein, K_mNumber of candidate frames for tracking target in mth frame image, l_i,mCoordinates of the start point of the X axis of the frame candidate for the i-th tracking target in the m-th frame image, w_i,mLength as the coordinates of the start point of the Y axis of the frame candidate of the ith tracking target in the mth frame image_i,mWidth for the length of the i-th tracked target candidate frame in the m-th frame image_i,mA width of a candidate frame for an ith tracking target in an mth frame image;

preferably, the convolutional network in the step 2 is a VGG16 network pre-trained in an ImageNet classification task, and a first layer of feature vectors of a tracking target position frame are extracted through a VGG16 network;

two-dimensional feature vector through channel c

The interpolation model of (2) is to calculate the two-dimensional feature vector of the channel c

Converting into a feature vector of a one-dimensional continuous space:

wherein the content of the first and second substances,

is a two-dimensional feature vector of channel c, b_cIs defined as a cubic interpolation function of three, N_cIs composed of

L is the length of the eigenvector of the one-dimensional continuous space, and Channel is the number of channels;

the convolution operator is:

wherein, y_i,mIs a response value of the tracking target i of the mth image,

is the two-dimensional feature vector of Channel c, Channel is the number of channels,

the feature vectors of the one-dimensional continuum of channels c,

the correlation filter is used for tracking the channel c of the target i in the mth frame image;

training the correlation filter through the training samples is:

given n training sample pairs { (y)_i,q,y'_i,q)}(q∈[m-n,m-1]) Training is carried out to obtain a correlation filter by optimizing a minimized objective function:

wherein, y_i,m-jIs the response value, y ', of the tracking target i of the m-j-th image'_i,m-jIs y_i,m-jThe ideal gaussian distribution of the total number of the particles,

is a heelThe correlation filter of the trace target i in the channel c of the mth frame image, and the weight value alpha_jThe influence factor of the training sample j is determined by a penalty function w, and the correlation filter of each channel is obtained through training

Response value y of tracking target i through m-th image_i,m(l) L is belonged to [0, L), find the maximum value y_i,m(l) Corresponding to l_p,i,m：

l_p,i,m＝argmax(y_i,m(l))l∈[0,L)

Wherein, L is the length of the feature vector of the one-dimensional continuous space;

will l_p,i,mPoints converted into two-dimensional feature vectors of channels

After being reduced into two-dimensional coordinates, the coordinates are mapped into coordinate points p under the current frame_i,m＝(x_p,i,m,y_p,i,m) I.e. tracking the target T for the ith frame in the mth frame image_iThe movement prediction center point of (a);

if tracking the target T_iIn a tracking state, only the candidate frames around the prediction position area are selected for subsequent target data matching:

setting tracking target T_iLength of the previous frame is length_i,m-1Width of_i,m-1I th tracking target T in m th frame image_iHas a movement prediction center point of p_i,m＝(x_p,i,m,y_p,i,m) The candidate frame center point of the ith tracking target in the mth frame image is c_i,m＝(l_i,m+lenth_i,m/2,w_i,m+width_i,m/2)i∈[1,K_m]And when the distance between the candidate frame central point and the mobile prediction central point meets the condition:

d(p_i,m,c_i,m)＝(x_p,i,m-l_i,m-lenth_i,m/2)²+(y_p,i,m-w_i,m-width_i,m/2)²＜min(lenth_i,m-1/2,width_i,m-1/2)

performing subsequent target data matching on the candidate frames meeting the conditions;

if tracking the target T_iIn the lost state, a candidate frame is selected to be screened near the position of the frame before it disappears:

taking the moving prediction central point t when the moving prediction central point disappears in the previous frame_i,m＝(x_t,i,m,y_t,i,m) Length of length_i,m-1Width of_i,m-1When the distance d (t) between the candidate frame center and the vanishing center_i,c_i,m) When the following conditions are satisfied:

d(t_i,m,c_i,m)＝(x_t,i,m-l_i,m-lenth_i,m/2)²+(y_t,i,m-w_i,m-width_i,m/2)²＜min(lenth_i,m-1/2,width_i,m-1/2)

if tracking the target T_iIn the unsuccessful matching tracking state, its candidate box center point may be updated using the moving predicted center point:

updating tracking target T_iThe candidate frame center point of (2) is a movement prediction center point p_i,m＝(x_p,i,m,y_p,i,m) The length of the candidate frame, the width of the candidate frame and the m-1 frame image are kept unchanged;

preferably, the candidate frame after screening in step 3 is a candidate frame screened according to the moving prediction center point in step 2;

the appearance similarity score in step 3 is specifically calculated as:

candidate frame D after screening of ith tracking target in mth frame image in step 2_i,mRemoving the connecting layer VGG16 network of the last layer of VGG16 to obtain the tracking target T in the mth frame image of N dimension_iAppearance feature vector of

Training in an end-to-end training mode through a training set given by the multi-target tracking public data set to respectively obtain an LSTM network with appearance characteristics and a first full connection layer FC 1;

will track the target T_iExtracting M N-dimensional appearance feature vectors by removing the VGG16 network of the last layer of the VGG16 from the data of the previous M frames of images, and then extracting N-dimensional combined historical appearance feature vectors by the LSTM network of the appearance features

Joint connection

And

through the first full connection layer FC1, the tracking target T is obtained_iAnd candidate frame D_i,m(ii) an appearance similarity score of S_A(T_i,D_i,m) If the target T is_iIf the image data of the previous frame is not generated, replacing the image data with a value of 0;

preferably, the motion similarity score in step 4 is calculated as:

step 2, the candidate frame D after screening of the ith tracking target in the mth frame image_i,mThe central point of (a) is:

(l_i,m+lenth_i,m/2,w_i,m+width_i,m/2)

target T is tracked by previous frame image_iThe center position of the candidate frame of (1) is:

(l_i,m-1+lenth_i,m-1/2,w_i,m-1+width_i,m-1/2)

the speed feature vector of the ith tracking target in the mth frame image is as follows:

training in an end-to-end training mode through a training set given by the multi-target tracking public data set to respectively obtain an LSTM network with speed characteristics and a second full connection layer FC 2;

extracting the speed characteristic vector of the ith tracking target in the M frames of images through an LSTM network of speed characteristics to obtain a motion characteristic vector of a joint history sequence

Joint connection

And

passing through the second fully-connected layer FC2, thereby tracking target T in a tracking state or a lost state_iAnd candidate frame D_i,mHas a motion similarity score of S_V(T_i,D_i,m) If the target T is_iIf the motion data of the previous frame is not generated, the motion data is replaced by a value of 0;

preferably, the interactive feature similarity score in step 5 is calculated as:

to screen the candidate frame D_i,mC of center coordinate_i,m＝(l_i,m+lenth_i,m/2,w_i,m+width_i,mAnd/2) establishing a fixed-size box with the length and the width H by taking the center as the center, and connecting the center coordinates c of the box with other candidate boxes_i',mThe coincident point is set as 1, the center of the fixed-size box is also set as 1, and the rest positions are set as 0, so that:

wherein the content of the first and second substances,

x∈[l_i,m+lenth_i,m/2-H/2,l_i,m+lenth_i,m/2+H/2]

y∈[w_i,m+width_i,m/2-H/2,w_i,m+width_i,m/2+H/2]

then will be

Conversion to length H²The one-dimensional vector of (1) to obtain an interactive feature vector of the candidate frame of

Training in an end-to-end training mode through a training set given by the multi-target tracking public data set to respectively obtain an LSTM network and a third full connection layer FC3 of interactive features;

with a target T_iEstablishing a frame with a fixed length and a fixed width H by taking the central coordinate of a certain frame of image as a center, setting a point which is superposed with the central coordinate of other tracking targets in the frame as 1, setting the center of the frame with the fixed length as 1, and setting the rest positions as 0 to obtain a target T_iIn the interactive feature vector of the frame, the target T is_iThe interactive feature vector of the previous M frames is extracted to a combined historical interactive feature vector through an LSTM network of interactive features

Association

And

through the third full connection layer FC3, T is obtained_iAnd D_i,mIs given by the interaction feature similarity score S_I(T_i,D_i,m) If the target T is_iIf the interactive feature vector of the previous frame is not generated, replacing the interactive feature vector with a value of 0;

preferably, the total similarity score in step 6 is:

S_total,i＝α₁S_A(T_i,D_i,m)+α₂S_V(T_i,D_i,m)+α₃S_I(T_i,D_i,m)

wherein alpha is₁Similarity coefficient for appearance feature，α₂Is a velocity feature similarity coefficient, alpha₃Is an interactive feature similarity coefficient;

the total similarity score is greater than the match score threshold S_total,iBeta. then candidate frame D_i,mConverting the image into a tracking frame of the tracking target in the m frames of images;

step 6, updating the state information of the tracking target through the step 2 to keep the tracking target in a tracking state, converting the tracking target in the tracking state which is not successfully matched by a plurality of continuous frames into a lost state, and not adopting the method in the step 2;

preferably, the overlapping degree between the tracking targets in step 8 is:

wherein A is a tracking target T_aArea of the tracking frame, B is the tracking target T_bFor the tracking target T with IOU > 0.8_aAnd tracking target T_bAccording to the total similarity score S obtained in the step 6_total,aAnd S_total,bComparing S with S_total,aAnd S_total,bThe lower tracking target is converted into a lost state and keeps S_total,aAnd S_total,bThe higher tracking target is in a tracking state;

preferably, the multi-frame image in step 9 is M_DAnd (5) frame.

Compared with the prior art, the invention has the following advantages and beneficial effects:

according to the method, an LSTM network frame is constructed according to the characteristic data of each target in the time sequence, so that the system can solve the problem of long-time shielding of the target, and the accuracy of target data matching is better improved by combining the characteristics of historical data;

the method combines the characteristics of the position, the appearance, the movement and the interaction of the tracking target, and adopts the convolution network to extract the appearance deep layer characteristic information and the shallow layer characteristic information of the object, so that the discrimination of the tracking target characteristics is improved; by using the direction and speed information of each frame of motion of the object, the accuracy of target matching is improved on the basis of the continuity characteristic of the motion information of the object; through the interaction characteristic information of the objects under the continuous frames, an interaction model is provided, and the acting force relation between the tracking target and other surrounding targets is analyzed, so that the matching accuracy is improved. The accuracy of target tracking is improved by using a multi-clue joint data matching mode;

and (3) calculating the moving position of the target under the current frame by adopting a rapid correlation filtering self-tracking method for each target, screening out a candidate frame conforming to a position area, and well reducing the calculation amount of a data correlation algorithm. The self-tracking algorithm can automatically track the tracking state target which is missed to be detected in the target detection, and the problem that the performance of the target detector is excessively depended on is solved.

Drawings

FIG. 1: the technical scheme of the invention is a general block diagram;

FIG. 2: a survival state diagram for a single target;

FIG. 3: an appearance characteristic model matching graph;

FIG. 4: matching graph of speed characteristic model;

FIG. 5: interactive feature model matching graphs;

FIG. 6: interactive characteristic LSTM network model matching graph;

FIG. 7: and (5) a system multi-target tracking schematic diagram.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

Embodiments of the present invention are described below with reference to fig. 1 to 6. The technical scheme of the embodiment is a multi-target tracking method based on time series multi-feature data association, which specifically comprises the following steps:

the frame image in the step 1 is the mth image, and the category number of the tracking target in the step 1 is N_mIn step 1, the candidate frame of the tracking target is:

in the step 2, the convolutional network is a VGG16 network pre-trained in an ImageNet classification task, and a first layer of feature vectors of a tracking target position frame are extracted through a VGG16 network;

two-dimensional feature vector through channel c

Converting into a feature vector of a one-dimensional continuous space:

wherein the content of the first and second substances,

L is the length of the eigenvector of the one-dimensional continuous space, and Channel 512 is the number of channels;

the convolution operator is:

wherein, y_i,mIs a response value of the tracking target i of the mth image,

the feature vectors of the one-dimensional continuum of channels c,

training the correlation filter through the training samples is:

wherein, y_i,m-jIs the response value of the tracking target i of the m-j-th image,y'_i,m-jis y_i,m-jThe ideal gaussian distribution of the total number of the particles,

for tracking the correlation filter of the target i in the channel c of the mth frame image, the weight value alpha_jThe influence factor of the training sample j is determined by a penalty function w, and the correlation filter of each channel is obtained through training

The number n of training samples is 30;

l_p,i,m＝argmax(y_i,m(l))l∈[0,L)

will l_p,i,mPoints converted into two-dimensional feature vectors of channels

the candidate frame after screening in the step 3 is the candidate frame screened according to the moving prediction center point in the step 2;

the appearance similarity score in step 3 is specifically calculated as:

candidate frame D of the ith tracking target in the mth frame image in the step 1_i,mBy removing VGG16One layer of connection layer VGG16 network obtains the tracking target T in the m frame image with the dimension of N being 1000_iAppearance feature vector of

Training in an end-to-end training mode through a training set given by a multi-target tracking public data set MOT17-Challenge to respectively obtain an LSTM network with appearance characteristics and a first full connection layer FC 1;

Joint connection

And

and 4, step 4: calculating the motion similarity scores of the tracking target and the candidate frame in the tracking state or the loss state;

the motion similarity score in step 4 is calculated as:

candidate frame D after screening of ith tracking target in mth frame image in step 2_i,mThe central point of (a) is:

(l_i,m+lenth_i,m/2,w_i,m+width_i,m/2)

(l_i,m-1+lenth_i,m-1/2,w_i,m-1+width_i,m-1/2)

training by a training set given by a multi-target tracking public data set MOT17-Challenge in an end-to-end training mode to respectively obtain an LSTM network and a second full connection layer FC2 of speed characteristics;

Joint connection

And

and 5: calculating the interactive feature similarity scores of the tracking target and the candidate frames in the tracking state or the loss state;

in step 5, the interactive feature similarity score is calculated as:

to screen the candidate frame D_i,mC of center coordinate_i,m＝(l_i,m+lenth_i,m/2,w_i,m+width_i,mAnd/2) establishing a fixed-size box with the length and the width H by taking the center as the center, and connecting the center coordinates c of the box with other candidate boxes_i',mThe coincident point is set to be 1, and the center of the square frame with fixed sizeAlso set to 1, the remaining positions are 0, resulting in:

wherein the content of the first and second substances,

x∈[l_i,m+lenth_i,m/2-H/2,l_i,m+lenth_i,m/2+H/2]

y∈[w_i,m+width_i,m/2-H/2,w_i,m+width_i,m/2+H/2]

then will be

Training by a training set given by a multi-target tracking public data set MOT17-Challenge in an end-to-end training mode to respectively obtain an LSTM network with interactive characteristics and a third full connection layer FC 3;

with a target T_iEstablishing a fixed-size frame with the length and width H being 300 by taking the central coordinate of a certain frame image as the center, setting the point which is superposed with the central coordinate of other tracking targets in the frame as 1, setting the center of the fixed-size frame as 1, and setting the rest positions as 0 to obtain a target T_iIn the interactive feature vector of the frame, the target T is_iThe interactive feature vector of the previous M frames is extracted to a combined historical interactive feature vector through an LSTM network of interactive features

Association

And

through the thirdFull connection layer FC3, giving T_iAnd D_i,mIs given by the interaction feature similarity score S_I(T_i,D_i,m) If the target T is_iIf the interactive feature vector of the previous frame is not generated, replacing the interactive feature vector with a value of 0;

the total similarity score in step 6 is:

S_total,i＝α₁S_A(T_i,D_i,m)+α₂S_V(T_i,D_i,m)+α₃S_I(T_i,D_i,m)

wherein alpha is₁Is an appearance feature similarity coefficient, alpha₂Is a velocity feature similarity coefficient, alpha₃Is an interactive feature similarity coefficient;

in step 8, the overlapping degree between the tracking targets is as follows:

and step 9: and recognizing the tracking target continuously in the lost state in the multi-frame image as the disappeared target, storing the data information of the tracking state, and not performing data matching operation on the tracking target.

In step 9, the multi-frame image is M_D30 frames of images.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A multi-target tracking method based on time series multi-feature fusion is characterized by comprising the following steps:

2. The multi-target tracking method based on time series multi-feature fusion according to claim 1, characterized in that: the frame image in the step 1 is the mth image, and the category number of the tracking target in the step 1 is N_mIn step 1, the candidate frame of the tracking target is:

wherein, K_mNumber of candidate frames for tracking target in mth frame image, l_i,mCoordinates of the start point of the X axis of the frame candidate for the i-th tracking target in the m-th frame image, w_i,mLength as the coordinates of the start point of the Y axis of the frame candidate of the ith tracking target in the mth frame image_i,mWidth for the length of the i-th tracked target candidate frame in the m-th frame image_i,mThe width of the candidate frame for the i-th tracking target in the m-th frame image.

3. The multi-target tracking method based on time series multi-feature fusion as claimed in claim 2, characterized in that: in the step 2, the convolutional network is a VGG16 network pre-trained in an ImageNet classification task, and a first layer of feature vectors of a tracking target position frame are extracted through a VGG16 network;

two-dimensional feature vector through channel c

Converting into a feature vector of a one-dimensional continuous space:

wherein the content of the first and second substances,

the convolution operator is:

wherein, y_i,mIs a response value of the tracking target i of the mth image,

the feature vectors of the one-dimensional continuum of channels c,

training the correlation filter through the training samples is:

given n training sample pairs { (y)_i,m-j,y′_i,m-j)},j∈[1,n]And training to obtain a correlation filter by minimizing the optimization of an objective function:

l_p,i,m＝argmax(y_i,m(l))l∈[0,L)

will l_p,i,mPoints converted into two-dimensional feature vectors of channels

updating tracking target T_iThe candidate frame center point of (2) is a movement prediction center point p_i,m＝(x_p,i,m,y_p,i,m) The length of the candidate frame, the width of the candidate frame and the m-1 frame image remain unchanged.

4. The multi-target tracking method based on time series multi-feature fusion according to claim 1, characterized in that: the candidate frame after screening in the step 3 is the candidate frame screened according to the moving prediction center point in the step 2;

the appearance similarity score in step 3 is specifically calculated as:

marking the candidate frame after calculation and screening in the step 2 as D_i,mRemoving the connecting layer VGG16 network of the last layer of VGG16 to obtain the tracking target T in the mth frame image of N dimension_iAppearance feature vector of

will track the target T_iExtracting M N-dimensional appearance feature vectors from the data of the previous M frames of images through the same VGG16 network of the last layer of the connection layer without VGG16, and then extracting N-dimensional combined historical appearance feature vectors through the LSTM network of appearance features

Joint connection

And

through the first full connection layer FC1, the tracking target T is obtained_iAnd candidate frame D_i,m(ii) an appearance similarity score of S_A(T_i,D_i,m) If the target T is tracked_iIf the image data of the previous frame is not generated, the value of 0 is substituted.

5. The multi-target tracking method based on time series multi-feature fusion as claimed in claim 2, characterized in that: the motion similarity score in step 4 is calculated as:

setting a candidate frame D after calculation and screening in the step 2_i,mThe central point of (a) is:

(l_i,m+lenth_i,m/2,w_i,m+width_i,m/2)

(l_i,m-1+lenth_i,m-1/2,w_i,m-1+width_i,m-1/2)

Joint connection

And

passing through the second fully-connected layer FC2, thereby tracking target T in a tracking state or a lost state_iAnd candidate frame D_i,mHas a motion similarity score of S_V(T_i,D_i,m) If the target T is tracked_iIf the motion data of the previous frame is not generated, the motion data is replaced by a value of 0.

6. The multi-target tracking method based on time series multi-feature fusion as claimed in claim 2, characterized in that: in step 5, the interactive feature similarity score is calculated as:

wherein the content of the first and second substances,

x∈[l_i,m+lenth_i,m/2-H/2,l_i,m+lenth_i,m/2+H/2]

y∈[w_i,m+width_i,m/2-H/2,w_i,m+width_i,m/2+H/2]

then will be

Training in an end-to-end training mode through a training set given by the multi-target tracking public data set to respectively obtain an LSTM network with interactive characteristics and a third full connection layer FC 3;

to track a target T_iEstablishing a frame with a fixed length and a fixed width H by taking the central coordinate of a certain frame of image as a center, setting a point which is superposed with the central coordinate of other tracking targets in the frame as 1, setting the center of the frame with the fixed length as 1, and setting the rest positions as 0 to obtain a tracking target T_iIn the interactive feature vector of the frame, the target T is tracked_iThe interactive feature vector of the previous M frames is extracted to a combined historical interactive feature vector through an LSTM network of interactive features

Association

And

through the third full connection layer FC3, T is obtained_iAnd D_i,mIs given by the interaction feature similarity score S_I(T_i,D_i,m) If the target T is tracked_iIf the interactive feature vector of the previous frame is not generated, the value of 0 is substituted.

7. The multi-target tracking method based on time series multi-feature fusion according to claim 1, characterized in that: the total similarity score in step 6 is:

S_total,i＝α₁S_A(T_i,D_i,m)+α₂S_V(T_i,D_i,m)+α₃S_I(T_i,D_i,m)

wherein alpha is₁Is an appearance feature similarity coefficient, alpha₂Is a velocity feature similarity coefficient, alpha₃Is the similarity coefficient of the interactive features, and S_A(T_i,D_i,m)、S_V(T_i,D_i,m)、S_I(T_i,D_i,m) Respectively obtaining an appearance similarity score, a motion similarity score and an interaction feature similarity score according to the steps 3-5;

and 6, updating the state information of the tracking target through the step 2 to keep the tracking target in a tracking state, converting the tracking target in the tracking state which is not successfully matched by the continuous multiple frames into a lost state, and not adopting the method in the step 2.

8. The multi-target tracking method based on time series multi-feature fusion according to claim 1, characterized in that: in step 8, the overlapping degree of each tracking target is as follows:

wherein A is a tracking target T_aArea of the tracking frame, B is the tracking target T_bFor the tracking target T with IOU > 0.8_aAnd tracking target T_bAccording to the total similarity score S obtained in the step 6_total,aAnd S_total,bComparing S with S_total,aAnd S_total,bThe lower tracking target is converted into a lost state and keeps S_total,aAnd S_total,bThe higher tracking target is in a tracking state.

9. The multi-target tracking method based on time series multi-feature fusion according to claim 1, characterized in that: in step 9, the multi-frame image is M_DAnd (5) frame.