CN113096156A

CN113096156A - End-to-end real-time three-dimensional multi-target tracking method and device for automatic driving

Info

Publication number: CN113096156A
Application number: CN202110441246.2A
Authority: CN
Inventors: 张宇翔; 张昱; 张燕咏; 吉建民
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-07-09

Abstract

The application provides an end-to-end real-time three-dimensional multi-target tracking method and device for automatic driving, when sensor data of a t-th frame are received, a process of acquiring and predicting a tracking result of the t-th frame according to a state set of the r-th frame and detecting detection surrounding frames of a plurality of corresponding objects in the sensor data of the t-th frame are executed in parallel, when the state set is updated to the t-1-th frame, each detection surrounding frame is associated with the state set of the t-1-th frame to obtain the state set of the t-th frame, and the state set of the t-1-th frame is updated to the state set of the t-th frame. It can be seen that when the sensor data of the t-th frame is received, the tracking result (r < t) of the t-th frame is directly predicted based on the state set of the r-th frame, so that three-dimensional multi-target tracking is realized, the efficiency of the three-dimensional multi-target tracking is improved, the state set is updated based on the sensor data of the t-th frame, and the accuracy of the three-dimensional multi-target tracking is ensured.

Description

End-to-end real-time three-dimensional multi-target tracking method and device for automatic driving

Technical Field

The application relates to the field of automatic driving, in particular to an end-to-end real-time three-dimensional multi-target tracking method and device for automatic driving.

Background

In an automatic driving scene, road conditions on a driving route of an automatic driving vehicle are not constant, if the road conditions cannot be acquired in real time and a path can be adjusted in time according to the road conditions, collision may occur, and the acquisition of the road conditions can be realized through three-dimensional multi-target tracking, so that the three-dimensional target tracking has a crucial effect on subsequent path planning and collision avoidance.

Therefore, how to provide a technical scheme for realizing three-dimensional multi-target tracking is a technical problem which needs to be solved urgently by people in the field at present.

Disclosure of Invention

The technical problem to be solved by the application is to provide an end-to-end real-time three-dimensional multi-target tracking method for automatic driving so as to realize three-dimensional multi-target tracking.

The application also provides an end-to-end real-time three-dimensional multi-target tracking device for automatic driving, which is used for ensuring the realization and application of the method in practice.

An end-to-end real-time three-dimensional multi-target tracking method for automatic driving comprises the following steps:

in the automatic driving process, when the sensor data of the t-th frame is received, the predicting step and the state updating step are executed in parallel; wherein t is a positive integer;

the predicting step includes:

acquiring a state set of an r frame; wherein r is smaller than t, the state set of the r-th frame is a set obtained by latest updating, the state set of the r-th frame is obtained by updating the sensor data of the r-th frame, and the state set comprises object state data of a plurality of objects;

predicting a tracking result of the t frame according to the state set of the r frame; the tracking result is used for representing the position and the posture of each object in the real scene in the t frame;

the state updating step includes:

detecting a plurality of objects contained in the sensor data of the t-th frame to obtain a plurality of detection surrounding frames of the t-th frame, and acquiring a state set of the t-1-th frame when the state set is updated to the t-1-th frame;

associating each detection surrounding frame of the t frame with the state set of the t-1 frame to obtain the state set of the t frame;

and updating the state set of the t-1 th frame into the state set of the t-th frame.

Optionally, the predicting the tracking result of the t-th frame according to the state set of the r-th frame includes:

calculating a bounding box of each object in the t frame based on the object state data of each object included in the state set of the r frame;

for each object, forming a sub-tracking result of the object in the t frame by using a bounding box of the object in the t frame and a track identifier in object state data corresponding to the object in the state set of the r frame;

and forming the tracking result of the t-th frame by using each sub-tracking result.

In the foregoing method, optionally, the calculating a bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the r-th frame includes:

calculating a first displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the r-th frame;

and calculating the bounding box of each object in the t frame according to the bounding box in the object state data of each object included in the state set of the r frame and the first displacement of each object in each dimension.

Optionally, the associating each detection enclosure frame of the t-th frame with the state set of the t-1 th frame to obtain the state set of the t-th frame includes:

calculating a predicted bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1 th frame;

constructing an affinity matrix based on each predicted bounding box and each detected bounding box of the t-th frame;

solving the affinity matrix to obtain each detection surrounding frame which is not matched, each prediction surrounding frame which is not matched and a plurality of matching pairs; each matching pair comprises a detection bounding box and a prediction bounding box;

and obtaining the state set of the t frame based on each detection surrounding frame which is not matched, each prediction surrounding frame which is not matched, each matching pair and the state set of the t-1 frame.

Optionally, the obtaining the state set of the t-th frame based on each unmatched detection bounding box, each unmatched prediction bounding box, each matching pair, and the state set of the t-1 th frame includes:

determining respective weights of a detection surrounding frame and a prediction surrounding frame in each matching pair, performing weighted average on the detection surrounding frame and an observation surrounding frame in each matching pair according to the respective weights of the detection surrounding frame and the prediction surrounding frame in each matching pair to obtain a first surrounding frame, adding one to a first counting result in initial object state data to obtain a new first counting result, and combining the first surrounding frame, the new first counting result, and a motion speed, a track identifier and a second counting result in the initial object state data to form first object state data; the initial object state data is object state data corresponding to a prediction surrounding frame in the matching pair in a state set of a t-1 th frame, the first counting result is used for representing the observed times of the object, and the second counting result is used for representing the continuous unobserved times of the object;

for a detection surrounding frame which is not matched, distributing a track identifier for the detection surrounding frame, initializing the movement speed and a second counting result corresponding to the detection surrounding frame, assigning a first counting result corresponding to the detection surrounding frame to be 1, and forming second object state data by the detection surrounding frame, the track identifier corresponding to the detection surrounding frame, the movement speed, the first counting result and the second counting result;

adding one to a second counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 frame aiming at each prediction bounding box which is not matched to obtain a new second counting result, and if the new second counting result is smaller than a disappearance threshold, forming third object state data by using the prediction bounding box, the new second counting result, the motion speed, the track identifier and the first counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 frame;

and forming a state set of the t-th frame by using all the first object state data, the second object state data and the third object state data.

In the foregoing method, optionally, the calculating a predicted bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1-th frame includes:

calculating a second displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the t-1 th frame;

and calculating a predicted surrounding frame of each object in the t frame according to the surrounding frame in the object state data of each object included in the state set of the t-1 frame and the second displacement of each object in each dimension.

Optionally, the method for detecting multiple objects included in the sensor data of the tth frame to obtain multiple detection bounding boxes of the tth frame includes:

and calling a three-dimensional target detector, detecting a plurality of objects contained in the sensor data of the t-th frame, and obtaining a plurality of detection surrounding frames of the t-th frame.

An end-to-end real-time three-dimensional multi-target tracking device for automatic driving comprises:

an execution unit for executing the prediction step and the state update step in parallel when the sensor data of the t-th frame is received during the automatic driving; wherein t is a positive integer;

the predicting step includes:

a prediction unit, configured to predict a tracking result of the t-th frame according to the state set of the r-th frame; the tracking result is used for representing the position and the posture of each object in the real scene in the t frame;

the state updating step includes:

Optionally, the above apparatus, wherein the execution unit is configured to predict the tracking result of the t-th frame according to the state set of the r-th frame, and includes the execution unit specifically configured to:

Optionally, in the above apparatus, the execution unit is configured to calculate a bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the r-th frame, and includes the execution unit specifically configured to:

A storage medium, the storage medium comprising a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the above end-to-end real-time three-dimensional multi-target tracking method facing automatic driving.

An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to perform the above described end-to-end real-time three-dimensional multi-object tracking method oriented to autopilot.

Compared with the prior art, the method has the following advantages:

the application provides an end-to-end real-time three-dimensional multi-target tracking method and device for automatic driving, wherein in the automatic driving process, when sensor data of a t-th frame is received, a prediction step and a state updating step are executed in parallel; t is a positive integer; the predicting step comprises: acquiring a state set of an r frame; r is smaller than t, the state set of the r frame is a set obtained by latest updating, the state set of the r frame is obtained by updating the sensor data of the r frame, and the state set comprises object state data of a plurality of objects; predicting the tracking result of the t frame according to the state set of the r frame; the tracking result is used for representing the position and the posture of each object in the real scene in the t frame; the state updating step comprises the following steps: detecting a plurality of objects contained in the sensor data of the t-th frame to obtain a plurality of detection surrounding frames of the t-th frame, and acquiring a state set of the t-1-th frame when the state set is updated to the t-1-th frame; associating each detection surrounding frame of the t frame with the state set of the t-1 frame to obtain the state set of the t frame; and updating the state set of the t-1 th frame into the state set of the t-th frame. According to the technical scheme, when the sensor data of the t-th frame is received, the tracking result of the t-th frame is predicted directly based on the state set of the r-th frame, so that three-dimensional multi-target tracking is achieved, the efficiency of the three-dimensional multi-target tracking is improved, the state set is updated based on the sensor data of the t-th frame, and the accuracy of the three-dimensional multi-target tracking is guaranteed.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a method for an end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present application;

FIG. 2 is a flowchart of another method of an end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present application;

FIG. 3 is a flowchart of another method of an end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present application;

FIG. 4 is an example diagram of an end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present application;

FIG. 5 is a diagram illustrating another example of an end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present disclosure;

FIG. 6 is a schematic structural diagram of an end-to-end real-time three-dimensional multi-target tracking device for automatic driving according to the present application;

fig. 7 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides an end-to-end real-time three-dimensional multi-target tracking method for automatic driving, which can be applied to a thunderMOT system, and a flow chart of the method is shown in FIG. 1 and specifically comprises the following steps:

s101, receiving sensor data of a t-th frame in the automatic driving process.

In this embodiment, sensor data of a t-th frame is received, where t is a positive integer, and the sensor data includes, but is not limited to, RGB image data, laser point cloud data, and inertial measurement unit data.

The sensor data is sensor data within a preset observation range of the vehicle.

In this embodiment, after receiving the sensor data of the t-th frame, step S102 and step S104 are executed in parallel.

S102, acquiring a state set of the r frame.

In the embodiment, when sensor data of a t-th frame in a preset observation range of a vehicle is received, a state set of the r-th frame is obtained; and the state set of the r frame is obtained by updating the sensor data of the r frame and comprises object state data of a plurality of objects.

In this embodiment, the object state data may be identified by s, and the object state data s includes motion state data s of the object^mAnd control status data s^cData of motion state s^mComprising a bounding box b and a speed of movement v, control-state data s^cIncluding a track identifier α, a first count result β, and a second count result γ, where a bounding box b ═ h, w, l, x, y, z, θ), (h, w, l) indicates a bounding box size, and (x, y, z) indicates a bounding box basePlane center, θ represents the bounding box yaw angle, and v ═ v_x,v_y,v_z)。

In this embodiment, the first counting result β is used to represent the number of times that the object state data corresponds to the object is observed, the second counting result is used to represent the number of times that the object state data corresponds to the object is not observed continuously, and the track identifier is used to uniquely identify the motion track of the object corresponding to the object state data. It should be noted that different objects correspond to different track identifiers.

S103, predicting the tracking result of the t frame according to the state set of the r frame.

In this embodiment, according to the object state data of each object in the state set of the r-th frame, the position and the posture of the t-th frame of each object are predicted, that is, the bounding box of each object in the t-th frame is predicted, the predicted bounding box of each object in the t-th frame and the track identifier in the object state data corresponding to the object form the sub-tracking result of the object in the t-th frame, and the sub-tracking results of all objects in the t-th frame are used as the tracking result of the t-th frame. And the tracking result is used for representing the position and the posture of each object in the real scene in the t-th frame.

Referring to fig. 2, the process of predicting the tracking result of the t-th frame according to the state set of the r-th frame includes:

s201, calculating a bounding box of each object in the t frame based on the object state data of each object in the state set of the r frame.

The method includes the steps of calculating a containing frame of each object according to object state data of each object included in a state set of an r-th frame, and specifically predicting a containing frame of each object of a t-th frame based on a containing frame in the object state data of each object included in the state set of the r-th frame.

Specifically, the process of calculating the bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the r-th frame specifically includes:

In the present embodiment, the first displacement of each object in each dimension is calculated from the movement speed and the unit time interval in the object state data of each object included in the state set of the r-th frame, in the embodiment, the time interval between the arrival of the sensor data of the t-th frame and the arrival of the sensor data of the r-th frame is calculated, and the first displacement of each object in each dimension is calculated based on the time interval between the arrival of the sensor data of the t-th frame and the arrival of the sensor data of the r-th frame and the motion speed in the object state data of each object included in the state set of the r-th frame, and calculating the bounding box of each object in the t frame according to the bounding box in the object state data of each object included in the state set of the r frame and the first displacement of each object in each dimension.

The above-mentioned process of calculating the bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the r-th frame is exemplified as follows:

for example, the state set of the r-th frame includes object state data s corresponding to the object a, where s includes an enclosing frame b ═ h, w, l, x, y, z, θ and a motion speed v ═ v_x,v_y,v_z) When the unit time interval is Δ t, the first displacement Δ x of the object a on the x-axis is equal to v_x(t-r) Δ t, a first displacement of the object in the y-axis being Δ y ═ v_y(t-r) Δ t, a first displacement of the object in the z-axis being Δ z ═ v_z(t-r)Δt。

Therefore, the bounding box b' of the object a in the t-th frame is (h, w, l, x + Δ x, y + Δ y, z + Δ z, θ).

S202, aiming at each object, forming a sub-tracking result of the object in the t frame by the bounding box of the object in the t frame and the track identifier in the object state data corresponding to the object in the state set of the r frame.

For each object, the bounding box of the object in the t-th frame and the track identifier in the object state data corresponding to the object in the state set of the r-th frame form a sub-tracking result of the object in the t-th frame, that is, the sub-tracking result of the object in the t-th frame includes the bounding box of the object in the t-th frame and the track identifier corresponding to the object.

And S203, forming the tracking result of the t-th frame by the sub-tracking results.

And forming the tracking result of the t frame by using the sub-tracking results of all the objects in the t frame.

S104, detecting a plurality of objects contained in the sensor data of the t-th frame to obtain a plurality of detection surrounding frames of the t-th frame.

And detecting each object included in the sensor data of the t-th frame to obtain a detection surrounding frame of each object included in the sensor data of the t-th frame.

In this embodiment, the process of detecting a plurality of objects included in the sensor data of the t-th frame to obtain a plurality of detection bounding boxes of the t-th frame specifically includes:

In this embodiment, the sensor data is sent to the three-dimensional target detector, and the three-dimensional target detector detects the sensor data, so that a detection bounding box of each object included in the sensor data fed back by the three-dimensional target detector is obtained.

It should be noted that, in this embodiment, a plurality of three-dimensional target detectors may be operated simultaneously, after receiving one frame of sensor data, the three-dimensional target detector for determining the sensor data of the current frame to be detected in a polling manner, and sending the sensor data of the current frame to the determined three-dimensional target detector, where the three-dimensional target detector detects the sensor data. Therefore, the data of any sensor can be detected without waiting after reaching the three-dimensional target detector, and the detection efficiency of the data of the sensor is improved.

S105, when the state set is updated to the t-1 frame, the state set of the t-1 frame is obtained.

In this embodiment, it is determined whether the state set is updated to the t-1 th frame, and if not, the step of determining whether the state set is updated to the t-1 th frame is performed until the state is updated to the t-1 th frame. And when the state set is updated to the t-1 frame, acquiring the state set of the t-1 frame.

In this embodiment, when the state set is updated each time, the old state set is updated to the new state set, and the frame number corresponding to the old state set is updated to the frame number corresponding to the new state set.

In this embodiment, since the state set and the frame number corresponding thereto are stored, whether the state set is updated to the t-1 frame can be determined by determining whether the frame number is t-1.

In this embodiment, the state set of the t-1 frame is obtained only when the blocked waiting state set reaches the t-1 frame, that is, the blocked waiting state set is updated to the state set of the t-1 frame.

S106, associating each detection surrounding frame of the t frame with the state set of the t-1 frame to obtain the state set of the t frame.

In this embodiment, the state set of the t-1 th frame and each detection enclosure frame of the t-1 th frame are associated to determine object state data in the state sets of the detection enclosure frame and the t-1 th frame that are matched with each other and object state data in the state sets of the detection enclosure frame and the t-1 th frame that are not matched with each other, so as to process the object state data in the state sets of the detection enclosure frame and the t-1 th frame that are matched with each other and the object state data in the state sets of the detection enclosure frame and the t-1 th frame that are not matched with each other, and obtain the state set of the t-1 th frame.

Referring to fig. 3, the process of associating each detection bounding box of the t-th frame with the state set of the t-1 th frame to obtain the state set of the t-th frame specifically includes the following steps:

s301, calculating a predicted surrounding frame of each object in the t frame based on the object state data of each object in the state set of the t-1 frame.

And calculating a predicted inclusion frame of each object in the t-th frame according to the object state data of each object included in the state set of the t-1 th frame, namely predicting the inclusion frame of each object in the t-th frame based on the inclusion frame in the object state data of each object included in the state set of the t-1 th frame.

Specifically, the process of calculating the predicted bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1-th frame specifically includes:

In the embodiment, a first displacement of each object in each dimension is calculated according to a motion speed and a unit time interval in object state data of each object included in a state set of a t-th frame, wherein the unit time interval is a time interval at which sensor data of two adjacent frames arrive, in the embodiment, a time interval at which sensor data of the t-th frame and sensor data of a t-1-th frame arrive, namely the unit time interval, is calculated, a second displacement of each object in each dimension is calculated based on the unit time interval and the motion speed in the object state data of each object included in the state set of the t-1-th frame, and a bounding box in the object state data of each object included in the state set of the t-1-th frame and the second displacement of each object in each dimension are calculated, and calculating a predicted surrounding box of each object in the t frame.

The above-mentioned process of calculating the bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1-th frame is exemplified as follows:

for example, the state set of the t-1 th frame includes object state data s corresponding to the object B₁Wherein s is₁Comprises an enclosing frame b₁(h, w, l, x, y, z, θ) and the movement velocity v₁＝(v_x,v_y,v_z) At a time interval Δ t, a first displacement Δ x of the object B on the x-axis₁＝v_xΔ t, a first displacement of the object in the y-axis of Δ y₁＝v_yΔ t, a first displacement of the object in the z-axis being Δ z₁＝v_zAt. So the prediction of object B in the t-th frame encloses box B₁'＝(h,w,l,x+Δx,y+Δy,z+Δz,θ)。

S302, constructing an affinity matrix based on each prediction bounding box and each detection bounding box of the t-th frame.

In this embodiment, for each detection bounding box of the t-th frame, an intersection ratio of the detection bounding box and each prediction bounding box in a three-dimensional space is calculated; and constructing an affinity matrix based on each intersection ratio obtained by calculation.

Wherein, the affinity matrix A_ij＝iou3d(b_i ^d,b_j ^p)whereb_i ^d∈D_t，b_j ^p∈P_t。

Wherein, b_i ^dIndicates the ith test bounding box, b_j ^pRepresents the jth prediction bounding box, and iou3d () represents the union of the two bounding boxes in three-dimensional space.

S303, solving the affinity matrix to obtain each unmatched detection surrounding frame, each unmatched prediction surrounding frame and a plurality of matched pairs.

In this embodiment, the affinity matrix is solved to obtain each unmatched detection enclosure frame, each unmatched prediction enclosure frame, and a plurality of matching pairs, where each matching pair includes one detection enclosure frame and one prediction enclosure frame, and it should be noted that the detection enclosure frame and the prediction enclosure frame in each matching pair are matched with each other.

It should be noted that each detection bounding box has a prediction bounding box matching with it, or has no prediction bounding box matching with it.

In this embodiment, a hungarian algorithm is adopted to solve the affinity matrix, and each detection enclosure frame which is not matched, and a plurality of matching pairs are obtained.

In this embodiment, the matching problem of the detection bounding box and the previous state set is abstracted to be the maximum bipartite graph matching problem, that is, an affinity matrix is constructed, and the affinity matrix is solved by using the hungarian algorithm, so that an unmatched detection bounding box, an unmatched prediction bounding box and a plurality of matching pairs are obtained.

S304, obtaining a state set of the t frame based on the unmatched detection surrounding frames, the unmatched prediction surrounding frames, the matched pairs and the state set of the t-1 frame.

In this embodiment, the state set of the t-th frame is obtained by processing according to each detection bounding box that is not matched, each prediction bounding box that is not matched, each matching pair, and the previous state set.

In this embodiment, different processing methods are used for processing each detection bounding box that is not matched, each prediction bounding box that is not matched, and each matching pair, so as to obtain a state set of the t-th frame.

Specifically, the process of obtaining the state set of the t-th frame based on the unmatched detection bounding boxes, the unmatched prediction bounding boxes, the matched pairs and the state set of the t-1 th frame includes:

determining respective weights of a detection surrounding frame and a prediction surrounding frame in each matching pair, performing weighted average on the detection surrounding frame and the observation surrounding frame in each matching pair according to the respective weights of the detection surrounding frame and the prediction surrounding frame in each matching pair to obtain a first surrounding frame, adding one to a first counting result in initial object state data to obtain a new first counting result, and forming the first object state data by using the first surrounding frame, the new first counting result, and a motion speed, a track identifier and a second counting result in the initial object state data; the initial object state data is object state data corresponding to a prediction surrounding frame in a matching pair in a state set of a t-1 th frame, a first counting result is used for representing the observed times of the object, and a second counting result is used for representing the continuous unobserved times of the object;

for the detection surrounding frames which are not matched, distributing track identifiers for the detection surrounding frames, initializing the movement speed and the second counting result corresponding to the detection surrounding frames, assigning the first counting result corresponding to the detection surrounding frames to be 1, and forming second object state data by the detection surrounding frames, the track identifiers corresponding to the detection surrounding frames, the movement speed, the first counting result and the second counting result;

adding one to a second counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 frame aiming at each prediction bounding box which is not matched to obtain a new second counting result, and if the new second counting result is smaller than a disappearance threshold, forming third object state data by the prediction bounding box, the new second counting result, the motion speed, the track identifier and the first counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 frame;

In this embodiment, since the automatic driving scene is highly dynamic, the observation point and the observed object both move, and therefore the tracked object may disappear from the observation range, and a new object may enter the observation range and start to be tracked. The present embodiment employs a state machine as shown in fig. 4 to manage the appearance and disappearance of objects in the observation range. In this embodiment, a threshold N1 is introduced to evaluate the stability of an object observed over a period of time: the counting result of the counter is automatically increased by 1 every time the object is observed once, when the observed counting beta of the object is less than N1, the object is called to be in an unstable state, and at the moment, whether the object really exists or a false positive is judged for a short time due to prediction errors; when β ≧ N1, the object is in a stable state, it can be considered to exist stably within the observation range. And further introduces a threshold N2 to assess whether the originally observed object has disappeared from the observation range: when an observed object is not observed in a certain frame, setting the counting result gamma of the object which is not observed to be 1, and then if the object is not observed continuously, automatically increasing the counting result gamma by 1 each time the object is not observed; when γ ≧ N2, the object is considered to be completely absent from the observation range. Wherein N1 and N1 are both positive integers.

It should be noted that the unmatched detection bounding box corresponds to a new object entering the observation range. The unmatched predicted bounding box corresponds to the original object that is not currently observed.

In this embodiment, for each matching pair, the weight of the detection bounding box in the matching pair is determined by the uncertainty of the detection bounding box, the weight of the prediction bounding box is determined by the uncertainty of the prediction bounding box, and the weight of the detection bounding box and the weight of the prediction bounding box in the matching pair are calculated by using a filter, optionally, the filter may be a kalman filter, specifically, the detection bounding box and the prediction bounding box in the matching pair are input into a kalman filter, so as to obtain the respective weights of the detection bounding box and the prediction bounding box output by the kalman filter; and carrying out weighted average on the detection surrounding frame and the prediction surrounding frame according to the respective weights of the detection surrounding frame and the prediction surrounding frame to obtain a first surrounding frame corresponding to the matching pair. In this embodiment, the update of the control state data corresponding to the matching pair corresponds to the directed edge whose transition condition is "observed" in fig. 5, that is, the object is observed, adding one to a first counting result in the initial object state data to obtain a new first counting result, forming first control state data by the new first counting result, a track identifier and a second counting result which are included in the initial object state data, forming a first motion state by the first enclosure frame and the motion speed in the initial object state data, forming first object state data by the first control state data and the first motion state data, wherein the first object state data is the object state data obtained after updating the initial object state data, and the initial object state data is the object state data corresponding to the prediction enclosure frame in the matching pair in the state set of the t-1 frame.

In this embodiment, for each detection bounding box that is not matched, new object state data needs to be created in the state set of the t-th frame, specifically, a trajectory identifier is assigned to the detection bounding box, and the movement velocity v and the second counting result γ are initialized, the movement velocity may be initialized to 0, the second counting result may be initialized to correspond to a directed edge from the "start" state to the "unstable" state in fig. 4, that is, the second counting result γ may be initialized to 0, the first counting result β may be assigned to 1, the detection bounding box and the movement velocity may be composed into second movement state data, the trajectory identifier, the first counting result and the second counting result may be composed into second control state data, and the second motion state data and the second control state data form second object state data corresponding to the detection surrounding frame, and the second object state data is newly created object state data.

In this embodiment, for each prediction enclosure that is not matched, only the prediction enclosure can be completely trusted due to the lack of observation results, and the update of the control state data corresponds to the directed edge whose transfer condition is "not observed" in fig. 4. Specifically, the second counting result in the object state data corresponding to the predicted bounding box in the state set of the t-1 th frame is added by one to obtain a new second counting result, and whether the new second counting result is smaller than a disappearance threshold is judged, the disappearance threshold corresponds to N2 in fig. 4, if the new second counting result is not smaller than the disappearance threshold, the object is considered to completely disappear from the observation range and the corresponding new object state data is not required to be calculated, if the new second counting result is smaller than the disappearance threshold, the motion speed in the object state data corresponding to the predicted bounding box in the state set of the predicted bounding box and the t-1 th frame is combined into third motion state data, and the trajectory identifier and the first counting result in the object state data corresponding to the predicted bounding box in the state set of the new second counting result and the t-1 th frame are combined into third motion state data, and forming third object state data, wherein the third object state data is object state data obtained by updating the object state data corresponding to the predicted bounding box in the state set of the t-1 frame.

In this embodiment, all the first object state data, all the second object state data, and all the third object state data form a state set of the t-th frame.

S107, updating the state set of the t-1 th frame into the state set of the t-th frame.

In this embodiment, the state set is updated, and the state set of the t-1 th frame is updated to the state set of the t-th frame.

According to the end-to-end real-time three-dimensional multi-target tracking method for automatic driving, in the process of automatic driving, when sensor data of a t-th frame are received, a process of acquiring a state set of the r-th frame is executed in parallel, a tracking result of the t-th frame is predicted according to the state set of the r-th frame, a plurality of objects contained in the sensor data of the t-th frame are detected, a plurality of detection surrounding frames of the t-th frame are obtained, when the state set is updated to the t-1-th frame, all the detection surrounding frames of the t-th frame are associated with the state set of the t-1-th frame, the state set of the t-1-th frame is obtained, and the state set of the t-1-th frame is updated to the state set of the t-th frame. By applying the end-to-end real-time three-dimensional multi-target tracking method for the automatic driving, when the sensor data of the t frame is received, the tracking result of the t frame is directly predicted based on the state set of the r frame, the tracking result of the t frame is predicted without updating the blocked waiting state set into the transition state set of the t frame, so that the three-dimensional multi-target tracking is realized, the efficiency of the three-dimensional multi-target tracking is improved, the state set is updated based on the sensor data of the t frame, and the accuracy of the three-dimensional multi-target tracking is ensured.

Referring to fig. 5, a specific implementation process of the end-to-end real-time three-dimensional multi-target tracking method for automatic driving is illustrated as follows:

and (3) state definition:

given an object o, its state s is represented by a motion state s^mAnd a control state s^cThe two parts are as follows: s^mIncluding the bounding box b ═ (h, w, l, x, y, z, θ) and the motion velocity v ═ v (v, v_x,v_y,v_z) Wherein (h, w, l) is the size of the bounding box, (x, y, z) is the central point of the bottom surface of the bounding box, and theta is the yaw angle of the bounding box; s^cIncluding the trajectory identifier alpha, the counter beta where the object o is observed and the counter gamma where it is not observed. The states of all tracked objects in the t-th frame constitute a set S_t＝{s_i|i＝1,...,n_tIn which n is_tRepresenting the number of tracked objects in the t-th frame.

In particular, sensor data I at the time of receiving the t-th frame_tThereafter, the fast prediction module and the slow update module are started and executed in parallel.

The slow speed detection module firstly calls a three-dimensional target detector pair I_tDetecting to obtain an object bounding box set D_tThen the slow tracking module will D_tAnd the last state set S_t-1Performing association and updating to obtain the t-th frame state set S_t. The rapid prediction module collects S according to the current state_rAnd directly predicting the bounding box of the tracked object in the t-th frame and taking the bounding box and the corresponding track identifier alpha as final output, wherein r is less than or equal to t because the state updating speed can be slower than the data arrival speed.

Wherein, for the fast prediction module:

since the update rate of the state set by the slow tracking module is likely to be slower than the data arrival rate, when the fast prediction task of the t-th frame is received, the globally shared state set, i.e., the pool, may be updated only to the r-th frame (r)<t), is denoted as S_r。

At this time, the set of wait states S is blocked differently from the conventional detection tracking paradigm_rIs updated to S_tThen, the fast prediction module is not blocked but outputs the tracking result of the t-th frame_rApplying constant speed model to predict the bounding box of the object in the t-th frame and quickly giving the t-th frameAnd tracking the result.

Specifically, assume that the state S e S of a certain object_rAnd the arrival time interval of each frame of data is fixed to be delta t, and according to the constant velocity model, the displacement of the object from the r frame to the t frame under the observation coordinate system is estimated as follows:

Δx＝v_x(t-r)Δt，Δy＝v_y(t-r)Δt，Δz＝v_z(t-r)Δt (1)

further, the t-th frame is given as a bounding box b (h, w, l, x + Δ x, y + Δ y, z + Δ z, θ).

For slow detection modules:

and in the target detection step, a three-dimensional target detector is called to obtain an object bounding box set in the input data. The specific implementation of the detection step is not limited, and the detection step can be integrated into a ThunderMOT system as long as the definition of the input and output interfaces of the detection step is met. The thunderMOT can flexibly access different detectors according to scenes, and hot plug of the detection model is achieved.

For slow trace modules:

the slow tracking module comprises a data association step and a state updating step.

The data association step specifically comprises:

the bounding box detected by the slow detection module is collected D_tAnd state set S_t-1And (6) matching. It should be noted that this step requires blocking the wait for the slow trace module to update to the state set S_t-1And then can continue to execute.

Specifically, the bounding boxes of the t-th frame are collected into a group D_tState set S with t-1 th frame_t-1The matching problem is abstracted to be the maximum bipartite graph matching problem and solved by adopting Hungarian algorithm, wherein an affinity matrix

Is defined as formula (2), m_tAs a set of bounding boxes D_tSize of (1), n_t-1Is a state set S_t-1The size of (2). P_tFor the t-th frame bounding box set predicted according to formula (1), the function iou3d () represents two bounding boxes in a three-dimensional spaceCross-over ratio between them.

A_ij＝iou3d(b_i ^d,b_j ^p)whereb_i ^d∈D_t，b_j ^p∈P_t (2)

The output of the data association step is three sets: matching set M_tSet of unmatched bounding boxes D_t', unmatched state set S_t-1'。

And a state updating step:

for each matching tuple (b)_t ^d,b_t ^p,s_t-1)∈M_t，s_t-1、b_t ^d、b_t ^pRespectively representing the state of an object o in the t-1 th frame and the observation surrounding frame and the prediction surrounding frame of the object corresponding to the t-th frame. Observation enclosure frame b_t ^dAs state s_t-1Calling a Kalman filter to carry out state estimation according to the observation result of the object in the t-th frame to obtain the motion state s of the object in the t-th frame_t ^m. According to Bayes rule, updated motion state s_t ^mIs b_t ^dAnd b_t ^pWeighted average of the state space, the weight (i.e. Kalman gain) is represented by b_t ^dAnd b_t ^pIs determined. Control state from

To

The update of (b) corresponds to the directed edge in fig. 5 with the transfer condition "observed".

Bounding boxes for each unmatched detection

Set of states S at the t-th frame_tTo create a new object state s_t，

B in (1) is initialized to

V in (1) is initialized to 0. The control state initialization corresponds to a directed edge from the "start" state to the "unstable" state in fig. 5.

For each unmatched tracked object state

Due to lack of observation results, only the bounding box predicted according to formula (1) can be completely trusted

Is directly to

B in (1) is set as

Thereby obtaining the motion state of the t-th frame

Control state from

To

The update of (b) corresponds to a directed edge in fig. 5 for which the transfer condition is "not observed".

In this embodiment, the fast tracking module and the fast prediction module do not have any explicit synchronization operation except for avoiding the read-write collision of the state set, so the execution time of the fast prediction module is the response delay of each frame of data. The rapid prediction module is realized based on motion model prediction, and compared with the traditional detection tracking method, the motion prediction calculation cost of each object is very small.

In this embodiment, in order to facilitate accessing of multiple deep learning-based 3D target detection models that depend on different software environments (e.g., different Python versions, different depth learning frames, and different versions of the same frame) to the system, slow target detection is implemented as a local server, and HTTP is selected as an application layer communication protocol. When each frame of data arrives, the fast prediction task and the slow tracking task are submitted to the thread pool to be executed in parallel. And the slow tracking module is used as a client to send a request to the detection server, and after the frame detection result is obtained, the slow tracking module calls the associativity () and update () methods in sequence to realize the association and update of the object state.

In this embodiment, the object state is shared by two types of tasks, namely, a fast prediction task and a slow tracking task, the fast prediction task calls a prediction method prediction () of the object state, and the slow tracking task calls an update method update () of the motion state. ThunderMOT ensures consistency of motion states in concurrent behavior by introducing a Read-Copy-Update lock (RCU) at the object level, while ensuring that fast-predict tasks do not timeout due to being blocked by a slow-track task's write operation to the object state. Under the mechanism, the task is quickly predicted to serve as a reader, and the motion state of an object can be accessed without acquiring any lock; while the slow update task acts as a writer, copying the copy first before accessing the motion state, then modifying the copy, and finally modifying the pointer to the history state to point to the updated state at the appropriate time. The prediction and the update of different objects are packaged into independent tasks to be submitted to the thread pool for execution, and the independent tasks are not influenced mutually.

In this embodiment, the ThunderMOT system, the detection server, the tracking server, and the data sensor share a file system. The raw sensor data is transferred among the components through a storage path for transferring the sensing data in the file system, so that the large communication overhead of explicitly transferring the sensing data through a byte stream is avoided.

In order to better illustrate the effect of the end-to-end real-time three-dimensional multi-target tracking method for automatic driving, which is provided by the embodiment of the application, the inventor of the application evaluates the thunderMOT system from two aspects of tracking speed and tracking precision through experiments.

The experimental environment of the application is a server configured with 2 Intel Xeon CPUs E5-2690 v3 (each CPU contains 12 physical cores and starts a hyper-thread), 4 Geforce RTX 2080Ti GPUs (each 4352 core and 12GB video memory) and 256GB memory. Server software configuration case: the operating system was Ubuntu 18.04, the Python version was 3.7.7, and the CUDA version was 10.2. And the tracking precision evaluation adopts a KITTI multi-target tracking data set.

The evaluation result shows that on the KITTI multi-target tracking data set, the average delay is 2.0 milliseconds, the worst delay is 8.6 milliseconds, the multi-target tracking accuracy MOTA (multiple object tracking access) can reach 83.71 percent, and the KITTI multi-target tracking data set has extremely good tracking speed and tracking precision.

Corresponding to the method shown in fig. 1, an embodiment of the present application further provides an end-to-end real-time three-dimensional multi-target tracking apparatus for automatic driving, which is used to implement the method shown in fig. 1 specifically, and a schematic structural diagram of the apparatus is shown in fig. 6, and specifically includes:

an execution unit 601 for executing the prediction step and the state update step in parallel when the sensor data of the t-th frame is received during the automatic driving; wherein t is a positive integer;

the predicting step includes:

the state updating step includes:

According to the end-to-end real-time three-dimensional multi-target tracking device for automatic driving, when sensor data of a t frame are received, a tracking result of the t frame is directly predicted based on a state set of the r frame, the tracking result of the t frame is predicted without updating a blocked waiting state set into a transition state set of the t frame, therefore, three-dimensional multi-target tracking is achieved, the efficiency of three-dimensional multi-target tracking is improved, the state set is updated based on the sensor data of the t frame, and the accuracy of three-dimensional multi-target tracking is guaranteed.

In an embodiment of the application, based on the foregoing solution, the performing unit 601 is configured to predict the tracking result of the t-th frame according to the state set of the r-th frame, and includes the performing unit 601 specifically configured to:

In an embodiment of the application, based on the foregoing solution, the executing unit 601 is configured to calculate a bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the r-th frame, and includes the executing unit 601 specifically configured to:

In an embodiment of the present application, based on the foregoing scheme, the executing unit 601 is configured to associate each detection enclosure frame of the t-th frame with the state set of the t-1 th frame to obtain the state set of the t-th frame, and includes the executing unit 601 specifically configured to:

In an embodiment of the present application, based on the foregoing scheme, the executing unit 601 is configured to obtain the state set of the t-th frame based on each detection bounding box that is not matched, each prediction bounding box that is not matched, each matching pair, and the state set of the t-1 th frame, and includes the executing unit 601 specifically configured to:

In an embodiment of the application, based on the foregoing solution, the execution unit 601 is configured to calculate a predicted bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1-th frame, and includes the execution unit specifically configured to:

In an embodiment of the application, based on the foregoing scheme, the execution unit is configured to detect a plurality of objects included in the sensor data of the t-th frame, and obtain a plurality of detection bounding boxes of the t-th frame, and the execution unit is specifically configured to:

The embodiment of the application also provides a storage medium, which comprises a stored instruction, wherein when the instruction runs, the device where the storage medium is located is controlled to execute the end-to-end real-time three-dimensional multi-target tracking method facing automatic driving.

An electronic device is provided in an embodiment of the present application, and its structural schematic diagram is shown in fig. 7, which specifically includes a memory 701 and one or more instructions 702, where the one or more instructions 702 are stored in the memory 701, and are configured to be executed by one or more processors 703 to perform the following operations according to the one or more instructions 702:

in the automatic driving process, when the sensor data of the t-th frame is received, the predicting step and the state updating step are executed in parallel; wherein, t is a positive integer,

the predicting step includes:

the state updating step includes:

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The method and the device for tracking the end-to-end real-time three-dimensional multiple targets facing the automatic driving are introduced in detail, a specific example is applied in the method to explain the principle and the implementation mode of the method, and the description of the embodiment is only used for helping to understand the method and the core idea of the method; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An end-to-end real-time three-dimensional multi-target tracking method for automatic driving is characterized by comprising the following steps:

the predicting step includes:

the state updating step includes:

2. The method according to claim 1, wherein said predicting the tracking result of the t frame according to the state set of the r frame comprises:

3. The method according to claim 2, wherein the calculating the bounding box of each object at the t-th frame based on the object state data of each object included in the state set of the r-th frame comprises:

4. The method of claim 1, wherein associating each detection bounding box of the tth frame with the state set of the t-1 frame to obtain the state set of the tth frame comprises:

5. The method of claim 4, wherein the deriving the state set of the t-th frame based on the unmatched detection bounding boxes, the unmatched prediction bounding boxes, the matched pairs, and the state set of the t-1-th frame comprises:

adding one to a second counting result in the object state corresponding to the prediction bounding box in the state set of the t-1 frame aiming at each prediction bounding box which is not matched to obtain a new second counting result, and if the new second counting result is smaller than a disappearance threshold, forming third object state data by using the prediction bounding box, the new second counting result, the motion speed, the track identifier and the first counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 frame;

6. The method according to claim 4, wherein calculating the predicted bounding box of each object at the t-th frame based on the object state data of each object included in the state set of the t-1-th frame comprises:

7. The method of claim 1, wherein detecting the plurality of objects contained in the sensor data of the tth frame, resulting in a plurality of detection bounding boxes of the tth frame, comprises:

and calling a three-dimensional target detector, detecting a plurality of objects included in the sensor data of the t-th frame, and obtaining a plurality of detection surrounding frames of the t-th frame.

8. An end-to-end real-time three-dimensional multi-target tracking device for automatic driving is characterized by comprising:

the predicting step includes:

the state updating step includes:

9. The apparatus as claimed in claim 8, wherein the execution unit is configured to predict the tracking result of the t frame according to the state set of the r frame, and the execution unit is specifically configured to:

10. The apparatus according to claim 9, wherein the execution unit is configured to calculate a bounding box of each object at a t-th frame based on the object state data of each object included in the state set of the r-th frame, and the execution unit is specifically configured to: