CN112862858A

CN112862858A - Multi-target tracking method based on scene motion information

Info

Publication number: CN112862858A
Application number: CN202110047457.8A
Authority: CN
Inventors: 刘勇; 翟光耀; 孔昕; 崔金浩
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-01-14
Filing date: 2021-01-14
Publication date: 2021-05-28

Abstract

The invention relates to a multi-target tracking method based on scene motion information, which comprises the following steps: s1: constructing a multi-target tracking system, which comprises a detection front-end module, a motion estimation module and a motion tracking module; s3: the detection front-end module acquires scene information and calculates and outputs detection results of a previous frame and a next frame; s4: inputting a pre-trained motion estimation module, performing data preprocessing, and calculating to obtain point-by-point motion information estimation of upper and lower frames; s5: the input motion tracking module is used for calculating the mean offset of the target-by-target detection frames, predicting the position of the next frame target-by-target detection frame, performing matching based on Hungarian algorithm, and respectively transmitting codes of the actual detection result of the last frame of each target successfully matched to the next frame detection frame of the same target to obtain multi-target tracking tracks between two frames; the method has the advantages of wide applicable scenes, simple calculation method, high precision and high speed, and does not need to adjust the hyper-parameters.

Description

Multi-target tracking method based on scene motion information

Technical Field

The invention belongs to the technical field of information and communication, and particularly relates to a multi-target tracking method based on scene motion information.

Background

The multi-target tracking task has been a long and challenging problem, aiming to locate objects in a piece of video sequence and assign the same instance with consistent coding. Many visual applications, such as autopilot, robot collision prediction and video face alignment, require multi-target tracking as their key component technologies. Recently, with the development of research related to three-dimensional object detection, multi-object tracking has been promoted.

Most end-to-end multi-target tracking methods face the problems of low precision and poor generalization capability. Although traditional filter-based approaches can achieve better results, it is difficult to give them optimal superparameters, and often fail in challenging cases.

In order to make the multi-target tracking method have better applicability, the multi-target tracking method needs to satisfy the following conditions:

(1) the method has wide application scenes and can be used for challenging scenes which cannot be used by the traditional method;

(2) the calculation method is simple, high in precision and high in speed;

(3) avoiding adjusting the hyper-parameters;

at present, no multi-target tracking method can solve the problems at the same time.

Disclosure of Invention

Based on the above defects in the prior art, the invention aims to provide a multi-target tracking method based on scene motion information, which has the advantages of wide applicable scenes, simple calculation method, high precision, high speed and no need of adjusting hyper-parameters.

The purpose of the invention can be realized by the following technical scheme:

a multi-target tracking method based on scene motion information is characterized by comprising the following steps:

step S1: constructing a multi-target tracking system; the multi-target tracking system comprises a detection front-end module, a motion estimation module and a motion tracking module; the detection front-end module comprises a laser radar sensor;

step S2: taking the 1 st frame as the previous frame, proceeding to steps S3 to S5;

step S3: the detection front-end module acquires scene information, calculates by adopting a PointRCNN method and outputs detection results of a previous frame and a next frame; the scene information comprises original point cloud information;

step S4: inputting original point cloud information of a previous frame and a next frame obtained by a front-end detection module into a pre-trained motion estimation module, preprocessing the original point cloud information, and calculating by using a FlowNet3D method to obtain point-by-point motion information estimation of the previous frame and the next frame;

step S5: estimating and inputting the point cloud information of the previous frame after the preprocessing of the motion estimation module and the point-by-point motion information obtained by the motion estimation module into the motion tracking module, and performing mean shift (delta x) of the target-by-target detection frame_n,Δy_n,Δz_n,Δθ_n) Calculating;

Δθ_nis the angular offset, Δ θ, of the nth target detection frame_nCalculating by adopting a constant angular velocity model; Δ x_n,Δy_n,Δz_nIs the central three-dimensional coordinate offset, Δ x, of the nth target detection frame_n,Δy_n,Δz_nThe calculation formula of (2) is as follows:

wherein C is the total number of laser points on the nth target detection frame,

detecting the motion flow direction of a single point on the nth target detection frame;

predicting the position of the next frame target-by-target detection frame as follows:

(X_pre,Y_pre,Z_pre)＝(X,Y,Z)+(ΔX,ΔY,ΔZ)

Θ_pre＝Θ+ΔΘ

in the formula (X)_pre,Y_pre,Z_pre) Predicting the set of central three-dimensional coordinates of all target detection frames of the next frame, wherein (X, Y and Z) are the set of central three-dimensional coordinates of all target detection frames of the previous frame; (DeltaX, DeltaY, DeltaZ) is the three-dimensional coordinate offset Deltax of the centers of all target detection frames of the upper and lower frames_n,Δy_n,Δz_nA set of (a); theta_prePredicting the set of direction angles of all target detection frames of the next frame, wherein theta is the set of direction angles of all target detection frames of the previous frame; Δ Θ is the angular offset Δ θ of all target detection frames_nA set of (a);

and respectively transmitting codes of the actual detection results of the previous frame of each target which are successfully matched to the detection frame of the next frame of the same target to obtain multi-target tracking tracks between two frames.

Preferably, the method further comprises the following steps:

step S6: taking the frames 2 to N as the previous frame in sequence, and repeating the steps S3-S5 to obtain N multi-target tracking tracks between two frames;

step S7: sequentially connecting the multi-target tracking tracks between N two frames according to the sequence of the frame numbers to obtain the multi-target tracking tracks between the 1 st frame and the N +1 st frame;

n is an integer of 2 or more.

Preferably, the step S4 of performing data preprocessing on the original point cloud information includes the following steps:

step B1: removing the ground by fitting a plane normal vector;

step B2: and adjusting the field angle data of the laser radar according to the calibration relation between the laser and the camera.

Preferably, the obtaining of the pre-trained motion estimation module in step S4 includes the following steps:

step A1: synthesizing a disparity map and a depth map into a scene stream by adopting a Flyingthings3D standard data set, taking the scene stream as a label file, and extracting a part of data set and a part of label file to obtain a training set based on Flyingthings 3D;

step A2: synthesizing a scene flow by using a KITTI scene flow data segment and a disparity map, taking the scene flow as a label file, and extracting all data segments and all label files to obtain a training set based on the KITTI scene flow;

step A3: training the motion estimation module by using a training set based on Flyingthings3D, and updating iteration parameters of the module to make the output of the module converge to a first preset threshold value; and globally adjusting the motion estimation module by using a training set based on KITTI scene flow, and updating iteration parameters to make the network prediction error converge to a second preset threshold value.

Preferably, step a3 includes:

training a motion estimation module by using a training set based on Flyingthings3D, taking two adjacent frame RGB-D picture pairs corresponding to timestamps in the training set based on Flyingthings3D as input of network pre-training, updating iteration parameters of the module, and enabling the output of the module to converge to a first preset threshold; and globally adjusting the motion estimation module by using the training set based on the KITTI scene flow, taking two adjacent frames of laser point clouds corresponding to the timestamps in the training set based on the KITTI scene flow as the input of the globally adjusting training of the network, updating the iteration parameters, and converging the network prediction error to a second preset threshold value.

Preferably, the step S3 of detecting the detection result output by the front-end module includes:

B＝{b_i|i＝1…N}

b＝{c,x,y,z,l,w,h,θ}

wherein, B is the set of all target detection results in a frame of point cloud, B_iThe method comprises the steps of obtaining a detection result of the ith target in a frame of point cloud, wherein N is the number of all detected targets in the frame of point cloud, b is the detection result of a single target in the frame of point cloud, c is the attribute of the target, and x, y and z are central three-dimensional coordinates of a detection frame; l, w and h are the length, width and height of the detection frame; θ is the direction angle of the detection frame.

Preferably, the attributes c of the target include: cars, pedestrians, riders.

Preferably, the calculation process of the FlowNet3D method in the step S4 includes:

encoding the geometric information and the color information of the upper and lower frame point clouds to obtain the high-dimensional characteristics of the upper and lower frame point clouds; the high-dimensional features are subjected to cascade coding to obtain inter-frame motion features; and calculating the motion characteristics of the frames through the processes of up-sampling and decoding to obtain the point-by-point motion information estimation of the upper frame and the lower frame.

Preferably, the lidar sensor of step S1 includes a 64-line lidar sensor.

Compared with the prior art, the invention has the following beneficial effects:

the invention introduces scene flow estimation based on learning into a three-dimensional multi-target tracking task for the first time, and updates the prediction of a small track by utilizing the motion consistency in a three-dimensional space, thereby avoiding the trouble of adjusting hyper-parameters and the problem of a constant motion model inherent in the traditional filter-based method.

According to the multi-target tracking method based on scene motion information, original point cloud information of a previous frame and an original point cloud information of a next frame obtained by a front-end detection module are input into a pre-trained motion estimation module, data preprocessing is carried out on the original point cloud information, and then point-by-point motion information estimation of the previous frame and the next frame is obtained through calculation by using a FlowNet3D method.

Before a motion estimation module calculates point-by-point motion information estimation of upper and lower frames, the method firstly carries out data preprocessing on original point cloud information, specifically comprises the steps of fitting a plane normal vector to remove the ground, adjusting the angle data of a laser radar view according to the calibration relation of laser and a camera and the like, and has at least 3-point benefits: (1) the real-time performance is strong, and the calculation cost is saved; (2) the method is not limited by the size of training data, has strong generalization and can process most of data; (3) the evaluation index (especially sAMOTA, AMOTA) obtained by the test is high, and the tracking effect is good.

Experiments performed on the KITTI MOT dataset show that the present invention has a competitive advantage over the most advanced existing methods and can be used to challenge scenarios that traditional methods are not typically able to use.

Drawings

Fig. 1 is a schematic flow chart of a multi-target tracking method based on scene motion information according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and the described embodiments are only some embodiments, but not all embodiments, of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

A multi-target tracking method based on scene motion information comprises the following steps:

step S1: constructing a multi-target tracking system; the multi-target tracking system comprises a detection front-end module, a motion estimation module and a motion tracking module; the detection front-end module comprises a 64-line laser radar sensor;

the detection result output by the detection front-end module comprises the following steps:

B＝{b_i|i＝1…N}

b＝{c,x,y,z,l,w,h,θ}

wherein, B is the set of all target detection results in a frame of point cloud, B_iThe method comprises the steps of obtaining a detection result of the ith target in a frame of point cloud, wherein N is the number of all detected targets in the frame of point cloud, b is the detection result of a single target in the frame of point cloud, c is the attribute of the target, and x, y and z are central three-dimensional coordinates of a detection frame; l, w and h are the length, width and height of the detection frame; θ is the direction angle of the detection frame. The attributes c of the target include: cars, pedestrians, riders.

the method for obtaining the pre-trained motion estimation module comprises the following steps:

step A3: training a motion estimation module by using a training set based on Flyingthings3D, taking two adjacent frame RGB-D picture pairs corresponding to timestamps in the training set based on Flyingthings3D as input of network pre-training, updating iteration parameters of the module, and enabling the output of the module to converge to a first preset threshold; and globally adjusting the motion estimation module by using the training set based on the KITTI scene flow, taking two adjacent frames of laser point clouds corresponding to the timestamps in the training set based on the KITTI scene flow as the input of the globally adjusting training of the network, updating the iteration parameters, and converging the network prediction error to a second preset threshold value.

The data preprocessing of the original point cloud information comprises the following steps:

step B1: removing the ground by fitting a plane normal vector;

The calculation process of the FlowNet3D method comprises the following steps:

Δθ_nis the n-thAngular offset of individual target detection frame, Delta theta_nCalculating by adopting a constant angular velocity model; Δ x_n,Δy_n,Δz_nIs the central three-dimensional coordinate offset, Δ x, of the nth target detection frame_n,Δy_n,Δz_nThe calculation formula of (2) is as follows:

(X_pre,Y_pre,Z_pre)＝(X,Y,Z)+(ΔX,ΔY,ΔZ)

Θ_pre＝Θ+ΔΘ

n is an integer greater than or equal to 2;

according to the method, most points in each frame of point cloud data belong to the ground, and the ground in a KITTIMOT data set mostly belongs to a horizontal plane, so that the ground can be fitted and removed by calculating the maximum plane normal vector. The benefit of this method versus fitting the ground by a learning method is: (1) the real-time performance is strong, and the calculation cost is saved; (2) the method is not limited by the size of training data, has strong generalization and can process most of data; (3) the evaluation index (especially sAMOTA, AMOTA) obtained by the test is high, and the tracking effect is good.

TABLE 1 comparison of the method with the mmMOT, FANTrack, AB3DMOT method under KITTI MOT Standard dataset (10Hz)

Method	Algorithm	Data form	sAMOTA	AMOTA	MOTA	AMOTP
							mmMOT	Study of	2D+3D	63.91	24.91	51.91	67.32
FANTrack	Study of	2D+3D	62.72	24.71	49.19	66.06
							AB3DMOT	Filtering	3D	69.81	27.26	57.06	67.00
Method for producing a composite material	Mixing	3D	74.37	29.78	63.53	67.03

The mmMOT, FANTrack and AB3DMOT method has superior performance in the current multi-target tracking field, the mmMOT uses a ready-made detector which comprises information of images and point clouds, data association is carried out through fusion of multi-mode features and learning of an adjacency matrix between objects, global optimization is carried out through linear programming, and the method belongs to an off-line method in multi-target tracking research. The method belongs to an on-line method. The FANTrack designs a deep association network, so that the deep association network can use a neural network to replace the traditional Hungarian algorithm to complete data association work. The AB3DMOT is a traditional three-dimensional multi-target tracking method based on a kalman filter, and although the method has impressive performance in terms of both accuracy and operation speed, a significant disadvantage is that the method only focuses on the bounding box of the detection result, and ignores the motion continuity inside the point cloud. Furthermore, due to the hand-made motion model, the kalman filter needs to adjust the hyper-parameters frequently, which is sensitive to the properties of the scene, such as frame rate. Table 1 shows the results of a comparison of the present method and the above method. The evaluation indexes adopt sAMOTA, AMOTA, AMOTP and MOTA which are recognized in the field of multi-target tracking. When the category of the 'vehicle' in the KITTI MOT standard data set (10Hz) is tracked and the judgment threshold value is 0.7, the method is superior to a learning method and a traditional filtering method on important indexes (sAMOTA and MOTA). In particular, analog high speed data (5Hz) can be obtained by down-sampling the KITTI MOT standard data set to halve the data set frame rate. On the data, the method can maintain the tracking effect, and the prior method fails more. Table 2 shows the effect of the method compared to AB3DMOT in tracking the "car" category in the KITTI MOT standard dataset (5Hz) and evaluating a threshold of 0.7. It can be seen that of the 4 indices, the method performed well.

TABLE 2 comparison of the method with the AB3DMOT method under KITTI MOT Standard data set (5Hz)

Method	Data form	sAMOTA	AMOTA	AMOTP	MOTA
						AB3DMOT	3D	56.71	18.96	58.00	45.25
Method for producing a composite material	3D	72.42	27.89	64.82	60.03

The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims

1. A multi-target tracking method based on scene motion information is characterized by comprising the following steps:

(X_pre,Y_pre,Z_pre)＝(X,Y,Z)+(ΔX,ΔY,ΔZ)

Θ_pre＝Θ+ΔΘ

in the formula (X)_pre,Y_pre,Z_pre) Predicting the set of central three-dimensional coordinates of all target detection frames of the next frame, wherein (X, Y and Z) are the set of central three-dimensional coordinates of all target detection frames of the previous frame; (Δ X, Δ Y, Δ Z) is the sum of the upper and lower framesOffset delta x of three-dimensional coordinate of center of target detection frame_n,Δy_n,Δz_nA set of (a); theta_prePredicting the set of direction angles of all target detection frames of the next frame, wherein theta is the set of direction angles of all target detection frames of the previous frame; Δ Θ is the angular offset Δ θ of all target detection frames_nA set of (a);

2. The multi-target tracking method based on scene motion information as claimed in claim 1, further comprising the steps of:

and N is an integer greater than or equal to 2.

3. The multi-target tracking method based on scene motion information as claimed in any one of claims 1 or 2, wherein the data preprocessing of the original point cloud information in step S4 includes the following steps:

step B1: removing the ground by fitting a plane normal vector;

4. The multi-target tracking method based on scene motion information as claimed in claim 3, wherein the obtaining step S4 of the pre-trained motion estimation module comprises the following steps:

5. The multi-target tracking method based on scene motion information as claimed in claim 4, wherein said step A3 includes:

training a motion estimation module by using a training set based on Flyingthings3D, taking two adjacent frame RGB-D picture pairs corresponding to timestamps in the training set based on Flyingthings3D as input of network pre-training, updating iteration parameters of the module, and enabling the output of the module to converge to a first preset threshold; and globally adjusting the motion estimation module by using a training set based on KITTI Scene Flow, taking two adjacent frames of laser point clouds corresponding to timestamps in the training set based on KITTI Scene Flow as the input of the global adjustment training of the network, updating iteration parameters, and converging the network prediction error to a second preset threshold value.

6. The multi-target tracking method based on scene motion information as claimed in claim 3, wherein the detection result output by the detection front-end module in step S3 includes:

B＝{b_i|i＝1…N}

b＝{c,x,y,z,l,w,h,θ}

7. The multi-target tracking method based on scene motion information as claimed in claim 6, wherein the attributes c of the targets comprise: cars, pedestrians, riders.

8. The multi-target tracking method based on scene motion information according to claim 3, wherein the calculation process of the FlowNet3D method in the step S4 comprises:

9. The multi-target tracking method based on scene motion information as claimed in claim 3, characterized in that: the lidar sensor in step S1 includes a 64-line lidar sensor.