CN113096156A - End-to-end real-time three-dimensional multi-target tracking method and device for automatic driving - Google Patents

End-to-end real-time three-dimensional multi-target tracking method and device for automatic driving Download PDF

Info

Publication number
CN113096156A
CN113096156A CN202110441246.2A CN202110441246A CN113096156A CN 113096156 A CN113096156 A CN 113096156A CN 202110441246 A CN202110441246 A CN 202110441246A CN 113096156 A CN113096156 A CN 113096156A
Authority
CN
China
Prior art keywords
frame
state set
state
bounding box
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110441246.2A
Other languages
Chinese (zh)
Inventor
张宇翔
张昱
张燕咏
吉建民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202110441246.2A priority Critical patent/CN113096156A/en
Publication of CN113096156A publication Critical patent/CN113096156A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The application provides an end-to-end real-time three-dimensional multi-target tracking method and device for automatic driving, when sensor data of a t-th frame are received, a process of acquiring and predicting a tracking result of the t-th frame according to a state set of the r-th frame and detecting detection surrounding frames of a plurality of corresponding objects in the sensor data of the t-th frame are executed in parallel, when the state set is updated to the t-1-th frame, each detection surrounding frame is associated with the state set of the t-1-th frame to obtain the state set of the t-th frame, and the state set of the t-1-th frame is updated to the state set of the t-th frame. It can be seen that when the sensor data of the t-th frame is received, the tracking result (r < t) of the t-th frame is directly predicted based on the state set of the r-th frame, so that three-dimensional multi-target tracking is realized, the efficiency of the three-dimensional multi-target tracking is improved, the state set is updated based on the sensor data of the t-th frame, and the accuracy of the three-dimensional multi-target tracking is ensured.

Description

End-to-end real-time three-dimensional multi-target tracking method and device for automatic driving
Technical Field
The application relates to the field of automatic driving, in particular to an end-to-end real-time three-dimensional multi-target tracking method and device for automatic driving.
Background
In an automatic driving scene, road conditions on a driving route of an automatic driving vehicle are not constant, if the road conditions cannot be acquired in real time and a path can be adjusted in time according to the road conditions, collision may occur, and the acquisition of the road conditions can be realized through three-dimensional multi-target tracking, so that the three-dimensional target tracking has a crucial effect on subsequent path planning and collision avoidance.
Therefore, how to provide a technical scheme for realizing three-dimensional multi-target tracking is a technical problem which needs to be solved urgently by people in the field at present.
Disclosure of Invention
The technical problem to be solved by the application is to provide an end-to-end real-time three-dimensional multi-target tracking method for automatic driving so as to realize three-dimensional multi-target tracking.
The application also provides an end-to-end real-time three-dimensional multi-target tracking device for automatic driving, which is used for ensuring the realization and application of the method in practice.
An end-to-end real-time three-dimensional multi-target tracking method for automatic driving comprises the following steps:
in the automatic driving process, when the sensor data of the t-th frame is received, the predicting step and the state updating step are executed in parallel; wherein t is a positive integer;
the predicting step includes:
acquiring a state set of an r frame; wherein r is smaller than t, the state set of the r-th frame is a set obtained by latest updating, the state set of the r-th frame is obtained by updating the sensor data of the r-th frame, and the state set comprises object state data of a plurality of objects;
predicting a tracking result of the t frame according to the state set of the r frame; the tracking result is used for representing the position and the posture of each object in the real scene in the t frame;
the state updating step includes:
detecting a plurality of objects contained in the sensor data of the t-th frame to obtain a plurality of detection surrounding frames of the t-th frame, and acquiring a state set of the t-1-th frame when the state set is updated to the t-1-th frame;
associating each detection surrounding frame of the t frame with the state set of the t-1 frame to obtain the state set of the t frame;
and updating the state set of the t-1 th frame into the state set of the t-th frame.
Optionally, the predicting the tracking result of the t-th frame according to the state set of the r-th frame includes:
calculating a bounding box of each object in the t frame based on the object state data of each object included in the state set of the r frame;
for each object, forming a sub-tracking result of the object in the t frame by using a bounding box of the object in the t frame and a track identifier in object state data corresponding to the object in the state set of the r frame;
and forming the tracking result of the t-th frame by using each sub-tracking result.
In the foregoing method, optionally, the calculating a bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the r-th frame includes:
calculating a first displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the r-th frame;
and calculating the bounding box of each object in the t frame according to the bounding box in the object state data of each object included in the state set of the r frame and the first displacement of each object in each dimension.
Optionally, the associating each detection enclosure frame of the t-th frame with the state set of the t-1 th frame to obtain the state set of the t-th frame includes:
calculating a predicted bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1 th frame;
constructing an affinity matrix based on each predicted bounding box and each detected bounding box of the t-th frame;
solving the affinity matrix to obtain each detection surrounding frame which is not matched, each prediction surrounding frame which is not matched and a plurality of matching pairs; each matching pair comprises a detection bounding box and a prediction bounding box;
and obtaining the state set of the t frame based on each detection surrounding frame which is not matched, each prediction surrounding frame which is not matched, each matching pair and the state set of the t-1 frame.
Optionally, the obtaining the state set of the t-th frame based on each unmatched detection bounding box, each unmatched prediction bounding box, each matching pair, and the state set of the t-1 th frame includes:
determining respective weights of a detection surrounding frame and a prediction surrounding frame in each matching pair, performing weighted average on the detection surrounding frame and an observation surrounding frame in each matching pair according to the respective weights of the detection surrounding frame and the prediction surrounding frame in each matching pair to obtain a first surrounding frame, adding one to a first counting result in initial object state data to obtain a new first counting result, and combining the first surrounding frame, the new first counting result, and a motion speed, a track identifier and a second counting result in the initial object state data to form first object state data; the initial object state data is object state data corresponding to a prediction surrounding frame in the matching pair in a state set of a t-1 th frame, the first counting result is used for representing the observed times of the object, and the second counting result is used for representing the continuous unobserved times of the object;
for a detection surrounding frame which is not matched, distributing a track identifier for the detection surrounding frame, initializing the movement speed and a second counting result corresponding to the detection surrounding frame, assigning a first counting result corresponding to the detection surrounding frame to be 1, and forming second object state data by the detection surrounding frame, the track identifier corresponding to the detection surrounding frame, the movement speed, the first counting result and the second counting result;
adding one to a second counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 frame aiming at each prediction bounding box which is not matched to obtain a new second counting result, and if the new second counting result is smaller than a disappearance threshold, forming third object state data by using the prediction bounding box, the new second counting result, the motion speed, the track identifier and the first counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 frame;
and forming a state set of the t-th frame by using all the first object state data, the second object state data and the third object state data.
In the foregoing method, optionally, the calculating a predicted bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1-th frame includes:
calculating a second displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the t-1 th frame;
and calculating a predicted surrounding frame of each object in the t frame according to the surrounding frame in the object state data of each object included in the state set of the t-1 frame and the second displacement of each object in each dimension.
Optionally, the method for detecting multiple objects included in the sensor data of the tth frame to obtain multiple detection bounding boxes of the tth frame includes:
and calling a three-dimensional target detector, detecting a plurality of objects contained in the sensor data of the t-th frame, and obtaining a plurality of detection surrounding frames of the t-th frame.
An end-to-end real-time three-dimensional multi-target tracking device for automatic driving comprises:
an execution unit for executing the prediction step and the state update step in parallel when the sensor data of the t-th frame is received during the automatic driving; wherein t is a positive integer;
the predicting step includes:
acquiring a state set of an r frame; wherein r is smaller than t, the state set of the r-th frame is a set obtained by latest updating, the state set of the r-th frame is obtained by updating the sensor data of the r-th frame, and the state set comprises object state data of a plurality of objects;
a prediction unit, configured to predict a tracking result of the t-th frame according to the state set of the r-th frame; the tracking result is used for representing the position and the posture of each object in the real scene in the t frame;
the state updating step includes:
detecting a plurality of objects contained in the sensor data of the t-th frame to obtain a plurality of detection surrounding frames of the t-th frame, and acquiring a state set of the t-1-th frame when the state set is updated to the t-1-th frame;
associating each detection surrounding frame of the t frame with the state set of the t-1 frame to obtain the state set of the t frame;
and updating the state set of the t-1 th frame into the state set of the t-th frame.
Optionally, the above apparatus, wherein the execution unit is configured to predict the tracking result of the t-th frame according to the state set of the r-th frame, and includes the execution unit specifically configured to:
calculating a bounding box of each object in the t frame based on the object state data of each object included in the state set of the r frame;
for each object, forming a sub-tracking result of the object in the t frame by using a bounding box of the object in the t frame and a track identifier in object state data corresponding to the object in the state set of the r frame;
and forming the tracking result of the t-th frame by using each sub-tracking result.
Optionally, in the above apparatus, the execution unit is configured to calculate a bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the r-th frame, and includes the execution unit specifically configured to:
calculating a first displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the r-th frame;
and calculating the bounding box of each object in the t frame according to the bounding box in the object state data of each object included in the state set of the r frame and the first displacement of each object in each dimension.
A storage medium, the storage medium comprising a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the above end-to-end real-time three-dimensional multi-target tracking method facing automatic driving.
An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to perform the above described end-to-end real-time three-dimensional multi-object tracking method oriented to autopilot.
Compared with the prior art, the method has the following advantages:
the application provides an end-to-end real-time three-dimensional multi-target tracking method and device for automatic driving, wherein in the automatic driving process, when sensor data of a t-th frame is received, a prediction step and a state updating step are executed in parallel; t is a positive integer; the predicting step comprises: acquiring a state set of an r frame; r is smaller than t, the state set of the r frame is a set obtained by latest updating, the state set of the r frame is obtained by updating the sensor data of the r frame, and the state set comprises object state data of a plurality of objects; predicting the tracking result of the t frame according to the state set of the r frame; the tracking result is used for representing the position and the posture of each object in the real scene in the t frame; the state updating step comprises the following steps: detecting a plurality of objects contained in the sensor data of the t-th frame to obtain a plurality of detection surrounding frames of the t-th frame, and acquiring a state set of the t-1-th frame when the state set is updated to the t-1-th frame; associating each detection surrounding frame of the t frame with the state set of the t-1 frame to obtain the state set of the t frame; and updating the state set of the t-1 th frame into the state set of the t-th frame. According to the technical scheme, when the sensor data of the t-th frame is received, the tracking result of the t-th frame is predicted directly based on the state set of the r-th frame, so that three-dimensional multi-target tracking is achieved, the efficiency of the three-dimensional multi-target tracking is improved, the state set is updated based on the sensor data of the t-th frame, and the accuracy of the three-dimensional multi-target tracking is guaranteed.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a method for an end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present application;
FIG. 2 is a flowchart of another method of an end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present application;
FIG. 3 is a flowchart of another method of an end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present application;
FIG. 4 is an example diagram of an end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present application;
FIG. 5 is a diagram illustrating another example of an end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present disclosure;
FIG. 6 is a schematic structural diagram of an end-to-end real-time three-dimensional multi-target tracking device for automatic driving according to the present application;
fig. 7 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides an end-to-end real-time three-dimensional multi-target tracking method for automatic driving, which can be applied to a thunderMOT system, and a flow chart of the method is shown in FIG. 1 and specifically comprises the following steps:
s101, receiving sensor data of a t-th frame in the automatic driving process.
In this embodiment, sensor data of a t-th frame is received, where t is a positive integer, and the sensor data includes, but is not limited to, RGB image data, laser point cloud data, and inertial measurement unit data.
The sensor data is sensor data within a preset observation range of the vehicle.
In this embodiment, after receiving the sensor data of the t-th frame, step S102 and step S104 are executed in parallel.
S102, acquiring a state set of the r frame.
In the embodiment, when sensor data of a t-th frame in a preset observation range of a vehicle is received, a state set of the r-th frame is obtained; and the state set of the r frame is obtained by updating the sensor data of the r frame and comprises object state data of a plurality of objects.
In this embodiment, the object state data may be identified by s, and the object state data s includes motion state data s of the objectmAnd control status data scData of motion state smComprising a bounding box b and a speed of movement v, control-state data scIncluding a track identifier α, a first count result β, and a second count result γ, where a bounding box b ═ h, w, l, x, y, z, θ), (h, w, l) indicates a bounding box size, and (x, y, z) indicates a bounding box basePlane center, θ represents the bounding box yaw angle, and v ═ vx,vy,vz)。
In this embodiment, the first counting result β is used to represent the number of times that the object state data corresponds to the object is observed, the second counting result is used to represent the number of times that the object state data corresponds to the object is not observed continuously, and the track identifier is used to uniquely identify the motion track of the object corresponding to the object state data. It should be noted that different objects correspond to different track identifiers.
S103, predicting the tracking result of the t frame according to the state set of the r frame.
In this embodiment, according to the object state data of each object in the state set of the r-th frame, the position and the posture of the t-th frame of each object are predicted, that is, the bounding box of each object in the t-th frame is predicted, the predicted bounding box of each object in the t-th frame and the track identifier in the object state data corresponding to the object form the sub-tracking result of the object in the t-th frame, and the sub-tracking results of all objects in the t-th frame are used as the tracking result of the t-th frame. And the tracking result is used for representing the position and the posture of each object in the real scene in the t-th frame.
Referring to fig. 2, the process of predicting the tracking result of the t-th frame according to the state set of the r-th frame includes:
s201, calculating a bounding box of each object in the t frame based on the object state data of each object in the state set of the r frame.
The method includes the steps of calculating a containing frame of each object according to object state data of each object included in a state set of an r-th frame, and specifically predicting a containing frame of each object of a t-th frame based on a containing frame in the object state data of each object included in the state set of the r-th frame.
Specifically, the process of calculating the bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the r-th frame specifically includes:
calculating a first displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the r-th frame;
and calculating the bounding box of each object in the t frame according to the bounding box in the object state data of each object included in the state set of the r frame and the first displacement of each object in each dimension.
In the present embodiment, the first displacement of each object in each dimension is calculated from the movement speed and the unit time interval in the object state data of each object included in the state set of the r-th frame, in the embodiment, the time interval between the arrival of the sensor data of the t-th frame and the arrival of the sensor data of the r-th frame is calculated, and the first displacement of each object in each dimension is calculated based on the time interval between the arrival of the sensor data of the t-th frame and the arrival of the sensor data of the r-th frame and the motion speed in the object state data of each object included in the state set of the r-th frame, and calculating the bounding box of each object in the t frame according to the bounding box in the object state data of each object included in the state set of the r frame and the first displacement of each object in each dimension.
The above-mentioned process of calculating the bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the r-th frame is exemplified as follows:
for example, the state set of the r-th frame includes object state data s corresponding to the object a, where s includes an enclosing frame b ═ h, w, l, x, y, z, θ and a motion speed v ═ vx,vy,vz) When the unit time interval is Δ t, the first displacement Δ x of the object a on the x-axis is equal to vx(t-r) Δ t, a first displacement of the object in the y-axis being Δ y ═ vy(t-r) Δ t, a first displacement of the object in the z-axis being Δ z ═ vz(t-r)Δt。
Therefore, the bounding box b' of the object a in the t-th frame is (h, w, l, x + Δ x, y + Δ y, z + Δ z, θ).
S202, aiming at each object, forming a sub-tracking result of the object in the t frame by the bounding box of the object in the t frame and the track identifier in the object state data corresponding to the object in the state set of the r frame.
For each object, the bounding box of the object in the t-th frame and the track identifier in the object state data corresponding to the object in the state set of the r-th frame form a sub-tracking result of the object in the t-th frame, that is, the sub-tracking result of the object in the t-th frame includes the bounding box of the object in the t-th frame and the track identifier corresponding to the object.
And S203, forming the tracking result of the t-th frame by the sub-tracking results.
And forming the tracking result of the t frame by using the sub-tracking results of all the objects in the t frame.
S104, detecting a plurality of objects contained in the sensor data of the t-th frame to obtain a plurality of detection surrounding frames of the t-th frame.
And detecting each object included in the sensor data of the t-th frame to obtain a detection surrounding frame of each object included in the sensor data of the t-th frame.
In this embodiment, the process of detecting a plurality of objects included in the sensor data of the t-th frame to obtain a plurality of detection bounding boxes of the t-th frame specifically includes:
and calling a three-dimensional target detector, detecting a plurality of objects contained in the sensor data of the t-th frame, and obtaining a plurality of detection surrounding frames of the t-th frame.
In this embodiment, the sensor data is sent to the three-dimensional target detector, and the three-dimensional target detector detects the sensor data, so that a detection bounding box of each object included in the sensor data fed back by the three-dimensional target detector is obtained.
It should be noted that, in this embodiment, a plurality of three-dimensional target detectors may be operated simultaneously, after receiving one frame of sensor data, the three-dimensional target detector for determining the sensor data of the current frame to be detected in a polling manner, and sending the sensor data of the current frame to the determined three-dimensional target detector, where the three-dimensional target detector detects the sensor data. Therefore, the data of any sensor can be detected without waiting after reaching the three-dimensional target detector, and the detection efficiency of the data of the sensor is improved.
S105, when the state set is updated to the t-1 frame, the state set of the t-1 frame is obtained.
In this embodiment, it is determined whether the state set is updated to the t-1 th frame, and if not, the step of determining whether the state set is updated to the t-1 th frame is performed until the state is updated to the t-1 th frame. And when the state set is updated to the t-1 frame, acquiring the state set of the t-1 frame.
In this embodiment, when the state set is updated each time, the old state set is updated to the new state set, and the frame number corresponding to the old state set is updated to the frame number corresponding to the new state set.
In this embodiment, since the state set and the frame number corresponding thereto are stored, whether the state set is updated to the t-1 frame can be determined by determining whether the frame number is t-1.
In this embodiment, the state set of the t-1 frame is obtained only when the blocked waiting state set reaches the t-1 frame, that is, the blocked waiting state set is updated to the state set of the t-1 frame.
S106, associating each detection surrounding frame of the t frame with the state set of the t-1 frame to obtain the state set of the t frame.
In this embodiment, the state set of the t-1 th frame and each detection enclosure frame of the t-1 th frame are associated to determine object state data in the state sets of the detection enclosure frame and the t-1 th frame that are matched with each other and object state data in the state sets of the detection enclosure frame and the t-1 th frame that are not matched with each other, so as to process the object state data in the state sets of the detection enclosure frame and the t-1 th frame that are matched with each other and the object state data in the state sets of the detection enclosure frame and the t-1 th frame that are not matched with each other, and obtain the state set of the t-1 th frame.
Referring to fig. 3, the process of associating each detection bounding box of the t-th frame with the state set of the t-1 th frame to obtain the state set of the t-th frame specifically includes the following steps:
s301, calculating a predicted surrounding frame of each object in the t frame based on the object state data of each object in the state set of the t-1 frame.
And calculating a predicted inclusion frame of each object in the t-th frame according to the object state data of each object included in the state set of the t-1 th frame, namely predicting the inclusion frame of each object in the t-th frame based on the inclusion frame in the object state data of each object included in the state set of the t-1 th frame.
Specifically, the process of calculating the predicted bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1-th frame specifically includes:
calculating a second displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the t-1 th frame;
and calculating a predicted surrounding frame of each object in the t frame according to the surrounding frame in the object state data of each object included in the state set of the t-1 frame and the second displacement of each object in each dimension.
In the embodiment, a first displacement of each object in each dimension is calculated according to a motion speed and a unit time interval in object state data of each object included in a state set of a t-th frame, wherein the unit time interval is a time interval at which sensor data of two adjacent frames arrive, in the embodiment, a time interval at which sensor data of the t-th frame and sensor data of a t-1-th frame arrive, namely the unit time interval, is calculated, a second displacement of each object in each dimension is calculated based on the unit time interval and the motion speed in the object state data of each object included in the state set of the t-1-th frame, and a bounding box in the object state data of each object included in the state set of the t-1-th frame and the second displacement of each object in each dimension are calculated, and calculating a predicted surrounding box of each object in the t frame.
The above-mentioned process of calculating the bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1-th frame is exemplified as follows:
for example, the state set of the t-1 th frame includes object state data s corresponding to the object B1Wherein s is1Comprises an enclosing frame b1(h, w, l, x, y, z, θ) and the movement velocity v1=(vx,vy,vz) At a time interval Δ t, a first displacement Δ x of the object B on the x-axis1=vxΔ t, a first displacement of the object in the y-axis of Δ y1=vyΔ t, a first displacement of the object in the z-axis being Δ z1=vzAt. So the prediction of object B in the t-th frame encloses box B1'=(h,w,l,x+Δx,y+Δy,z+Δz,θ)。
S302, constructing an affinity matrix based on each prediction bounding box and each detection bounding box of the t-th frame.
In this embodiment, for each detection bounding box of the t-th frame, an intersection ratio of the detection bounding box and each prediction bounding box in a three-dimensional space is calculated; and constructing an affinity matrix based on each intersection ratio obtained by calculation.
Wherein, the affinity matrix Aij=iou3d(bi d,bj p)wherebi d∈Dt,bj p∈Pt
Wherein, bi dIndicates the ith test bounding box, bj pRepresents the jth prediction bounding box, and iou3d () represents the union of the two bounding boxes in three-dimensional space.
S303, solving the affinity matrix to obtain each unmatched detection surrounding frame, each unmatched prediction surrounding frame and a plurality of matched pairs.
In this embodiment, the affinity matrix is solved to obtain each unmatched detection enclosure frame, each unmatched prediction enclosure frame, and a plurality of matching pairs, where each matching pair includes one detection enclosure frame and one prediction enclosure frame, and it should be noted that the detection enclosure frame and the prediction enclosure frame in each matching pair are matched with each other.
It should be noted that each detection bounding box has a prediction bounding box matching with it, or has no prediction bounding box matching with it.
In this embodiment, a hungarian algorithm is adopted to solve the affinity matrix, and each detection enclosure frame which is not matched, and a plurality of matching pairs are obtained.
In this embodiment, the matching problem of the detection bounding box and the previous state set is abstracted to be the maximum bipartite graph matching problem, that is, an affinity matrix is constructed, and the affinity matrix is solved by using the hungarian algorithm, so that an unmatched detection bounding box, an unmatched prediction bounding box and a plurality of matching pairs are obtained.
S304, obtaining a state set of the t frame based on the unmatched detection surrounding frames, the unmatched prediction surrounding frames, the matched pairs and the state set of the t-1 frame.
In this embodiment, the state set of the t-th frame is obtained by processing according to each detection bounding box that is not matched, each prediction bounding box that is not matched, each matching pair, and the previous state set.
In this embodiment, different processing methods are used for processing each detection bounding box that is not matched, each prediction bounding box that is not matched, and each matching pair, so as to obtain a state set of the t-th frame.
Specifically, the process of obtaining the state set of the t-th frame based on the unmatched detection bounding boxes, the unmatched prediction bounding boxes, the matched pairs and the state set of the t-1 th frame includes:
determining respective weights of a detection surrounding frame and a prediction surrounding frame in each matching pair, performing weighted average on the detection surrounding frame and the observation surrounding frame in each matching pair according to the respective weights of the detection surrounding frame and the prediction surrounding frame in each matching pair to obtain a first surrounding frame, adding one to a first counting result in initial object state data to obtain a new first counting result, and forming the first object state data by using the first surrounding frame, the new first counting result, and a motion speed, a track identifier and a second counting result in the initial object state data; the initial object state data is object state data corresponding to a prediction surrounding frame in a matching pair in a state set of a t-1 th frame, a first counting result is used for representing the observed times of the object, and a second counting result is used for representing the continuous unobserved times of the object;
for the detection surrounding frames which are not matched, distributing track identifiers for the detection surrounding frames, initializing the movement speed and the second counting result corresponding to the detection surrounding frames, assigning the first counting result corresponding to the detection surrounding frames to be 1, and forming second object state data by the detection surrounding frames, the track identifiers corresponding to the detection surrounding frames, the movement speed, the first counting result and the second counting result;
adding one to a second counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 frame aiming at each prediction bounding box which is not matched to obtain a new second counting result, and if the new second counting result is smaller than a disappearance threshold, forming third object state data by the prediction bounding box, the new second counting result, the motion speed, the track identifier and the first counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 frame;
and forming a state set of the t-th frame by using all the first object state data, the second object state data and the third object state data.
In this embodiment, since the automatic driving scene is highly dynamic, the observation point and the observed object both move, and therefore the tracked object may disappear from the observation range, and a new object may enter the observation range and start to be tracked. The present embodiment employs a state machine as shown in fig. 4 to manage the appearance and disappearance of objects in the observation range. In this embodiment, a threshold N1 is introduced to evaluate the stability of an object observed over a period of time: the counting result of the counter is automatically increased by 1 every time the object is observed once, when the observed counting beta of the object is less than N1, the object is called to be in an unstable state, and at the moment, whether the object really exists or a false positive is judged for a short time due to prediction errors; when β ≧ N1, the object is in a stable state, it can be considered to exist stably within the observation range. And further introduces a threshold N2 to assess whether the originally observed object has disappeared from the observation range: when an observed object is not observed in a certain frame, setting the counting result gamma of the object which is not observed to be 1, and then if the object is not observed continuously, automatically increasing the counting result gamma by 1 each time the object is not observed; when γ ≧ N2, the object is considered to be completely absent from the observation range. Wherein N1 and N1 are both positive integers.
It should be noted that the unmatched detection bounding box corresponds to a new object entering the observation range. The unmatched predicted bounding box corresponds to the original object that is not currently observed.
In this embodiment, for each matching pair, the weight of the detection bounding box in the matching pair is determined by the uncertainty of the detection bounding box, the weight of the prediction bounding box is determined by the uncertainty of the prediction bounding box, and the weight of the detection bounding box and the weight of the prediction bounding box in the matching pair are calculated by using a filter, optionally, the filter may be a kalman filter, specifically, the detection bounding box and the prediction bounding box in the matching pair are input into a kalman filter, so as to obtain the respective weights of the detection bounding box and the prediction bounding box output by the kalman filter; and carrying out weighted average on the detection surrounding frame and the prediction surrounding frame according to the respective weights of the detection surrounding frame and the prediction surrounding frame to obtain a first surrounding frame corresponding to the matching pair. In this embodiment, the update of the control state data corresponding to the matching pair corresponds to the directed edge whose transition condition is "observed" in fig. 5, that is, the object is observed, adding one to a first counting result in the initial object state data to obtain a new first counting result, forming first control state data by the new first counting result, a track identifier and a second counting result which are included in the initial object state data, forming a first motion state by the first enclosure frame and the motion speed in the initial object state data, forming first object state data by the first control state data and the first motion state data, wherein the first object state data is the object state data obtained after updating the initial object state data, and the initial object state data is the object state data corresponding to the prediction enclosure frame in the matching pair in the state set of the t-1 frame.
In this embodiment, for each detection bounding box that is not matched, new object state data needs to be created in the state set of the t-th frame, specifically, a trajectory identifier is assigned to the detection bounding box, and the movement velocity v and the second counting result γ are initialized, the movement velocity may be initialized to 0, the second counting result may be initialized to correspond to a directed edge from the "start" state to the "unstable" state in fig. 4, that is, the second counting result γ may be initialized to 0, the first counting result β may be assigned to 1, the detection bounding box and the movement velocity may be composed into second movement state data, the trajectory identifier, the first counting result and the second counting result may be composed into second control state data, and the second motion state data and the second control state data form second object state data corresponding to the detection surrounding frame, and the second object state data is newly created object state data.
In this embodiment, for each prediction enclosure that is not matched, only the prediction enclosure can be completely trusted due to the lack of observation results, and the update of the control state data corresponds to the directed edge whose transfer condition is "not observed" in fig. 4. Specifically, the second counting result in the object state data corresponding to the predicted bounding box in the state set of the t-1 th frame is added by one to obtain a new second counting result, and whether the new second counting result is smaller than a disappearance threshold is judged, the disappearance threshold corresponds to N2 in fig. 4, if the new second counting result is not smaller than the disappearance threshold, the object is considered to completely disappear from the observation range and the corresponding new object state data is not required to be calculated, if the new second counting result is smaller than the disappearance threshold, the motion speed in the object state data corresponding to the predicted bounding box in the state set of the predicted bounding box and the t-1 th frame is combined into third motion state data, and the trajectory identifier and the first counting result in the object state data corresponding to the predicted bounding box in the state set of the new second counting result and the t-1 th frame are combined into third motion state data, and forming third object state data, wherein the third object state data is object state data obtained by updating the object state data corresponding to the predicted bounding box in the state set of the t-1 frame.
In this embodiment, all the first object state data, all the second object state data, and all the third object state data form a state set of the t-th frame.
S107, updating the state set of the t-1 th frame into the state set of the t-th frame.
In this embodiment, the state set is updated, and the state set of the t-1 th frame is updated to the state set of the t-th frame.
According to the end-to-end real-time three-dimensional multi-target tracking method for automatic driving, in the process of automatic driving, when sensor data of a t-th frame are received, a process of acquiring a state set of the r-th frame is executed in parallel, a tracking result of the t-th frame is predicted according to the state set of the r-th frame, a plurality of objects contained in the sensor data of the t-th frame are detected, a plurality of detection surrounding frames of the t-th frame are obtained, when the state set is updated to the t-1-th frame, all the detection surrounding frames of the t-th frame are associated with the state set of the t-1-th frame, the state set of the t-1-th frame is obtained, and the state set of the t-1-th frame is updated to the state set of the t-th frame. By applying the end-to-end real-time three-dimensional multi-target tracking method for the automatic driving, when the sensor data of the t frame is received, the tracking result of the t frame is directly predicted based on the state set of the r frame, the tracking result of the t frame is predicted without updating the blocked waiting state set into the transition state set of the t frame, so that the three-dimensional multi-target tracking is realized, the efficiency of the three-dimensional multi-target tracking is improved, the state set is updated based on the sensor data of the t frame, and the accuracy of the three-dimensional multi-target tracking is ensured.
Referring to fig. 5, a specific implementation process of the end-to-end real-time three-dimensional multi-target tracking method for automatic driving is illustrated as follows:
and (3) state definition:
given an object o, its state s is represented by a motion state smAnd a control state scThe two parts are as follows: smIncluding the bounding box b ═ (h, w, l, x, y, z, θ) and the motion velocity v ═ v (v, vx,vy,vz) Wherein (h, w, l) is the size of the bounding box, (x, y, z) is the central point of the bottom surface of the bounding box, and theta is the yaw angle of the bounding box; scIncluding the trajectory identifier alpha, the counter beta where the object o is observed and the counter gamma where it is not observed. The states of all tracked objects in the t-th frame constitute a set St={si|i=1,...,ntIn which n istRepresenting the number of tracked objects in the t-th frame.
In particular, sensor data I at the time of receiving the t-th frametThereafter, the fast prediction module and the slow update module are started and executed in parallel.
The slow speed detection module firstly calls a three-dimensional target detector pair ItDetecting to obtain an object bounding box set DtThen the slow tracking module will DtAnd the last state set St-1Performing association and updating to obtain the t-th frame state set St. The rapid prediction module collects S according to the current staterAnd directly predicting the bounding box of the tracked object in the t-th frame and taking the bounding box and the corresponding track identifier alpha as final output, wherein r is less than or equal to t because the state updating speed can be slower than the data arrival speed.
Wherein, for the fast prediction module:
since the update rate of the state set by the slow tracking module is likely to be slower than the data arrival rate, when the fast prediction task of the t-th frame is received, the globally shared state set, i.e., the pool, may be updated only to the r-th frame (r)<t), is denoted as Sr
At this time, the set of wait states S is blocked differently from the conventional detection tracking paradigmrIs updated to StThen, the fast prediction module is not blocked but outputs the tracking result of the t-th framerApplying constant speed model to predict the bounding box of the object in the t-th frame and quickly giving the t-th frameAnd tracking the result.
Specifically, assume that the state S e S of a certain objectrAnd the arrival time interval of each frame of data is fixed to be delta t, and according to the constant velocity model, the displacement of the object from the r frame to the t frame under the observation coordinate system is estimated as follows:
Δx=vx(t-r)Δt,Δy=vy(t-r)Δt,Δz=vz(t-r)Δt (1)
further, the t-th frame is given as a bounding box b (h, w, l, x + Δ x, y + Δ y, z + Δ z, θ).
For slow detection modules:
and in the target detection step, a three-dimensional target detector is called to obtain an object bounding box set in the input data. The specific implementation of the detection step is not limited, and the detection step can be integrated into a ThunderMOT system as long as the definition of the input and output interfaces of the detection step is met. The thunderMOT can flexibly access different detectors according to scenes, and hot plug of the detection model is achieved.
For slow trace modules:
the slow tracking module comprises a data association step and a state updating step.
The data association step specifically comprises:
the bounding box detected by the slow detection module is collected DtAnd state set St-1And (6) matching. It should be noted that this step requires blocking the wait for the slow trace module to update to the state set St-1And then can continue to execute.
Specifically, the bounding boxes of the t-th frame are collected into a group DtState set S with t-1 th framet-1The matching problem is abstracted to be the maximum bipartite graph matching problem and solved by adopting Hungarian algorithm, wherein an affinity matrix
Figure BDA0003035139920000161
Is defined as formula (2), mtAs a set of bounding boxes DtSize of (1), nt-1Is a state set St-1The size of (2). PtFor the t-th frame bounding box set predicted according to formula (1), the function iou3d () represents two bounding boxes in a three-dimensional spaceCross-over ratio between them.
Aij=iou3d(bi d,bj p)wherebi d∈Dt,bj p∈Pt (2)
The output of the data association step is three sets: matching set MtSet of unmatched bounding boxes Dt', unmatched state set St-1'。
And a state updating step:
for each matching tuple (b)t d,bt p,st-1)∈Mt,st-1、bt d、bt pRespectively representing the state of an object o in the t-1 th frame and the observation surrounding frame and the prediction surrounding frame of the object corresponding to the t-th frame. Observation enclosure frame bt dAs state st-1Calling a Kalman filter to carry out state estimation according to the observation result of the object in the t-th frame to obtain the motion state s of the object in the t-th framet m. According to Bayes rule, updated motion state st mIs bt dAnd bt pWeighted average of the state space, the weight (i.e. Kalman gain) is represented by bt dAnd bt pIs determined. Control state from
Figure BDA0003035139920000171
To
Figure BDA0003035139920000172
The update of (b) corresponds to the directed edge in fig. 5 with the transfer condition "observed".
Bounding boxes for each unmatched detection
Figure BDA0003035139920000173
Set of states S at the t-th frametTo create a new object state st
Figure BDA0003035139920000174
B in (1) is initialized to
Figure BDA0003035139920000175
V in (1) is initialized to 0. The control state initialization corresponds to a directed edge from the "start" state to the "unstable" state in fig. 5.
For each unmatched tracked object state
Figure BDA0003035139920000176
Due to lack of observation results, only the bounding box predicted according to formula (1) can be completely trusted
Figure BDA0003035139920000177
Is directly to
Figure BDA0003035139920000178
B in (1) is set as
Figure BDA0003035139920000179
Thereby obtaining the motion state of the t-th frame
Figure BDA00030351399200001710
Control state from
Figure BDA00030351399200001711
To
Figure BDA00030351399200001712
The update of (b) corresponds to a directed edge in fig. 5 for which the transfer condition is "not observed".
In this embodiment, the fast tracking module and the fast prediction module do not have any explicit synchronization operation except for avoiding the read-write collision of the state set, so the execution time of the fast prediction module is the response delay of each frame of data. The rapid prediction module is realized based on motion model prediction, and compared with the traditional detection tracking method, the motion prediction calculation cost of each object is very small.
In this embodiment, in order to facilitate accessing of multiple deep learning-based 3D target detection models that depend on different software environments (e.g., different Python versions, different depth learning frames, and different versions of the same frame) to the system, slow target detection is implemented as a local server, and HTTP is selected as an application layer communication protocol. When each frame of data arrives, the fast prediction task and the slow tracking task are submitted to the thread pool to be executed in parallel. And the slow tracking module is used as a client to send a request to the detection server, and after the frame detection result is obtained, the slow tracking module calls the associativity () and update () methods in sequence to realize the association and update of the object state.
In this embodiment, the object state is shared by two types of tasks, namely, a fast prediction task and a slow tracking task, the fast prediction task calls a prediction method prediction () of the object state, and the slow tracking task calls an update method update () of the motion state. ThunderMOT ensures consistency of motion states in concurrent behavior by introducing a Read-Copy-Update lock (RCU) at the object level, while ensuring that fast-predict tasks do not timeout due to being blocked by a slow-track task's write operation to the object state. Under the mechanism, the task is quickly predicted to serve as a reader, and the motion state of an object can be accessed without acquiring any lock; while the slow update task acts as a writer, copying the copy first before accessing the motion state, then modifying the copy, and finally modifying the pointer to the history state to point to the updated state at the appropriate time. The prediction and the update of different objects are packaged into independent tasks to be submitted to the thread pool for execution, and the independent tasks are not influenced mutually.
In this embodiment, the ThunderMOT system, the detection server, the tracking server, and the data sensor share a file system. The raw sensor data is transferred among the components through a storage path for transferring the sensing data in the file system, so that the large communication overhead of explicitly transferring the sensing data through a byte stream is avoided.
In order to better illustrate the effect of the end-to-end real-time three-dimensional multi-target tracking method for automatic driving, which is provided by the embodiment of the application, the inventor of the application evaluates the thunderMOT system from two aspects of tracking speed and tracking precision through experiments.
The experimental environment of the application is a server configured with 2 Intel Xeon CPUs E5-2690 v3 (each CPU contains 12 physical cores and starts a hyper-thread), 4 Geforce RTX 2080Ti GPUs (each 4352 core and 12GB video memory) and 256GB memory. Server software configuration case: the operating system was Ubuntu 18.04, the Python version was 3.7.7, and the CUDA version was 10.2. And the tracking precision evaluation adopts a KITTI multi-target tracking data set.
The evaluation result shows that on the KITTI multi-target tracking data set, the average delay is 2.0 milliseconds, the worst delay is 8.6 milliseconds, the multi-target tracking accuracy MOTA (multiple object tracking access) can reach 83.71 percent, and the KITTI multi-target tracking data set has extremely good tracking speed and tracking precision.
Corresponding to the method shown in fig. 1, an embodiment of the present application further provides an end-to-end real-time three-dimensional multi-target tracking apparatus for automatic driving, which is used to implement the method shown in fig. 1 specifically, and a schematic structural diagram of the apparatus is shown in fig. 6, and specifically includes:
an execution unit 601 for executing the prediction step and the state update step in parallel when the sensor data of the t-th frame is received during the automatic driving; wherein t is a positive integer;
the predicting step includes:
acquiring a state set of an r frame; wherein r is smaller than t, the state set of the r-th frame is a set obtained by latest updating, the state set of the r-th frame is obtained by updating the sensor data of the r-th frame, and the state set comprises object state data of a plurality of objects;
predicting a tracking result of the t frame according to the state set of the r frame; the tracking result is used for representing the position and the posture of each object in the real scene in the t frame;
the state updating step includes:
detecting a plurality of objects contained in the sensor data of the t-th frame to obtain a plurality of detection surrounding frames of the t-th frame, and acquiring a state set of the t-1-th frame when the state set is updated to the t-1-th frame;
associating each detection surrounding frame of the t frame with the state set of the t-1 frame to obtain the state set of the t frame;
and updating the state set of the t-1 th frame into the state set of the t-th frame.
According to the end-to-end real-time three-dimensional multi-target tracking device for automatic driving, when sensor data of a t frame are received, a tracking result of the t frame is directly predicted based on a state set of the r frame, the tracking result of the t frame is predicted without updating a blocked waiting state set into a transition state set of the t frame, therefore, three-dimensional multi-target tracking is achieved, the efficiency of three-dimensional multi-target tracking is improved, the state set is updated based on the sensor data of the t frame, and the accuracy of three-dimensional multi-target tracking is guaranteed.
In an embodiment of the application, based on the foregoing solution, the performing unit 601 is configured to predict the tracking result of the t-th frame according to the state set of the r-th frame, and includes the performing unit 601 specifically configured to:
calculating a bounding box of each object in the t frame based on the object state data of each object included in the state set of the r frame;
for each object, forming a sub-tracking result of the object in the t frame by using a bounding box of the object in the t frame and a track identifier in object state data corresponding to the object in the state set of the r frame;
and forming the tracking result of the t-th frame by using each sub-tracking result.
In an embodiment of the application, based on the foregoing solution, the executing unit 601 is configured to calculate a bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the r-th frame, and includes the executing unit 601 specifically configured to:
calculating a first displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the r-th frame;
and calculating the bounding box of each object in the t frame according to the bounding box in the object state data of each object included in the state set of the r frame and the first displacement of each object in each dimension.
In an embodiment of the present application, based on the foregoing scheme, the executing unit 601 is configured to associate each detection enclosure frame of the t-th frame with the state set of the t-1 th frame to obtain the state set of the t-th frame, and includes the executing unit 601 specifically configured to:
calculating a predicted bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1 th frame;
constructing an affinity matrix based on each predicted bounding box and each detected bounding box of the t-th frame;
solving the affinity matrix to obtain each detection surrounding frame which is not matched, each prediction surrounding frame which is not matched and a plurality of matching pairs; each matching pair comprises a detection bounding box and a prediction bounding box;
and obtaining the state set of the t frame based on each detection surrounding frame which is not matched, each prediction surrounding frame which is not matched, each matching pair and the state set of the t-1 frame.
In an embodiment of the present application, based on the foregoing scheme, the executing unit 601 is configured to obtain the state set of the t-th frame based on each detection bounding box that is not matched, each prediction bounding box that is not matched, each matching pair, and the state set of the t-1 th frame, and includes the executing unit 601 specifically configured to:
determining respective weights of a detection surrounding frame and a prediction surrounding frame in each matching pair, performing weighted average on the detection surrounding frame and an observation surrounding frame in each matching pair according to the respective weights of the detection surrounding frame and the prediction surrounding frame in each matching pair to obtain a first surrounding frame, adding one to a first counting result in initial object state data to obtain a new first counting result, and combining the first surrounding frame, the new first counting result, and a motion speed, a track identifier and a second counting result in the initial object state data to form first object state data; the initial object state data is object state data corresponding to a prediction surrounding frame in the matching pair in a state set of a t-1 th frame, the first counting result is used for representing the observed times of the object, and the second counting result is used for representing the continuous unobserved times of the object;
for a detection surrounding frame which is not matched, distributing a track identifier for the detection surrounding frame, initializing the movement speed and a second counting result corresponding to the detection surrounding frame, assigning a first counting result corresponding to the detection surrounding frame to be 1, and forming second object state data by the detection surrounding frame, the track identifier corresponding to the detection surrounding frame, the movement speed, the first counting result and the second counting result;
adding one to a second counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 frame aiming at each prediction bounding box which is not matched to obtain a new second counting result, and if the new second counting result is smaller than a disappearance threshold, forming third object state data by using the prediction bounding box, the new second counting result, the motion speed, the track identifier and the first counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 frame;
and forming a state set of the t-th frame by using all the first object state data, the second object state data and the third object state data.
In an embodiment of the application, based on the foregoing solution, the execution unit 601 is configured to calculate a predicted bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1-th frame, and includes the execution unit specifically configured to:
calculating a second displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the t-1 th frame;
and calculating a predicted surrounding frame of each object in the t frame according to the surrounding frame in the object state data of each object included in the state set of the t-1 frame and the second displacement of each object in each dimension.
In an embodiment of the application, based on the foregoing scheme, the execution unit is configured to detect a plurality of objects included in the sensor data of the t-th frame, and obtain a plurality of detection bounding boxes of the t-th frame, and the execution unit is specifically configured to:
and calling a three-dimensional target detector, detecting a plurality of objects contained in the sensor data of the t-th frame, and obtaining a plurality of detection surrounding frames of the t-th frame.
The embodiment of the application also provides a storage medium, which comprises a stored instruction, wherein when the instruction runs, the device where the storage medium is located is controlled to execute the end-to-end real-time three-dimensional multi-target tracking method facing automatic driving.
An electronic device is provided in an embodiment of the present application, and its structural schematic diagram is shown in fig. 7, which specifically includes a memory 701 and one or more instructions 702, where the one or more instructions 702 are stored in the memory 701, and are configured to be executed by one or more processors 703 to perform the following operations according to the one or more instructions 702:
in the automatic driving process, when the sensor data of the t-th frame is received, the predicting step and the state updating step are executed in parallel; wherein, t is a positive integer,
the predicting step includes:
acquiring a state set of an r frame; wherein r is smaller than t, the state set of the r-th frame is a set obtained by latest updating, the state set of the r-th frame is obtained by updating the sensor data of the r-th frame, and the state set comprises object state data of a plurality of objects;
predicting a tracking result of the t frame according to the state set of the r frame; the tracking result is used for representing the position and the posture of each object in the real scene in the t frame;
the state updating step includes:
detecting a plurality of objects contained in the sensor data of the t-th frame to obtain a plurality of detection surrounding frames of the t-th frame, and acquiring a state set of the t-1-th frame when the state set is updated to the t-1-th frame;
associating each detection surrounding frame of the t frame with the state set of the t-1 frame to obtain the state set of the t frame;
and updating the state set of the t-1 th frame into the state set of the t-th frame.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The method and the device for tracking the end-to-end real-time three-dimensional multiple targets facing the automatic driving are introduced in detail, a specific example is applied in the method to explain the principle and the implementation mode of the method, and the description of the embodiment is only used for helping to understand the method and the core idea of the method; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. An end-to-end real-time three-dimensional multi-target tracking method for automatic driving is characterized by comprising the following steps:
in the automatic driving process, when the sensor data of the t-th frame is received, the predicting step and the state updating step are executed in parallel; wherein, t is a positive integer,
the predicting step includes:
acquiring a state set of an r frame; wherein r is smaller than t, the state set of the r-th frame is a set obtained by latest updating, the state set of the r-th frame is obtained by updating the sensor data of the r-th frame, and the state set comprises object state data of a plurality of objects;
predicting a tracking result of the t frame according to the state set of the r frame; the tracking result is used for representing the position and the posture of each object in the real scene in the t frame;
the state updating step includes:
detecting a plurality of objects contained in the sensor data of the t-th frame to obtain a plurality of detection surrounding frames of the t-th frame, and acquiring a state set of the t-1-th frame when the state set is updated to the t-1-th frame;
associating each detection surrounding frame of the t frame with the state set of the t-1 frame to obtain the state set of the t frame;
and updating the state set of the t-1 th frame into the state set of the t-th frame.
2. The method according to claim 1, wherein said predicting the tracking result of the t frame according to the state set of the r frame comprises:
calculating a bounding box of each object in the t frame based on the object state data of each object included in the state set of the r frame;
for each object, forming a sub-tracking result of the object in the t frame by using a bounding box of the object in the t frame and a track identifier in object state data corresponding to the object in the state set of the r frame;
and forming the tracking result of the t-th frame by using each sub-tracking result.
3. The method according to claim 2, wherein the calculating the bounding box of each object at the t-th frame based on the object state data of each object included in the state set of the r-th frame comprises:
calculating a first displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the r-th frame;
and calculating the bounding box of each object in the t frame according to the bounding box in the object state data of each object included in the state set of the r frame and the first displacement of each object in each dimension.
4. The method of claim 1, wherein associating each detection bounding box of the tth frame with the state set of the t-1 frame to obtain the state set of the tth frame comprises:
calculating a predicted bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1 th frame;
constructing an affinity matrix based on each predicted bounding box and each detected bounding box of the t-th frame;
solving the affinity matrix to obtain each detection surrounding frame which is not matched, each prediction surrounding frame which is not matched and a plurality of matching pairs; each matching pair comprises a detection bounding box and a prediction bounding box;
and obtaining the state set of the t frame based on each detection surrounding frame which is not matched, each prediction surrounding frame which is not matched, each matching pair and the state set of the t-1 frame.
5. The method of claim 4, wherein the deriving the state set of the t-th frame based on the unmatched detection bounding boxes, the unmatched prediction bounding boxes, the matched pairs, and the state set of the t-1-th frame comprises:
determining respective weights of a detection surrounding frame and a prediction surrounding frame in each matching pair, performing weighted average on the detection surrounding frame and an observation surrounding frame in each matching pair according to the respective weights of the detection surrounding frame and the prediction surrounding frame in each matching pair to obtain a first surrounding frame, adding one to a first counting result in initial object state data to obtain a new first counting result, and combining the first surrounding frame, the new first counting result, and a motion speed, a track identifier and a second counting result in the initial object state data to form first object state data; the initial object state data is object state data corresponding to a prediction surrounding frame in the matching pair in a state set of a t-1 th frame, the first counting result is used for representing the observed times of the object, and the second counting result is used for representing the continuous unobserved times of the object;
for a detection surrounding frame which is not matched, distributing a track identifier for the detection surrounding frame, initializing the movement speed and a second counting result corresponding to the detection surrounding frame, assigning a first counting result corresponding to the detection surrounding frame to be 1, and forming second object state data by the detection surrounding frame, the track identifier corresponding to the detection surrounding frame, the movement speed, the first counting result and the second counting result;
adding one to a second counting result in the object state corresponding to the prediction bounding box in the state set of the t-1 frame aiming at each prediction bounding box which is not matched to obtain a new second counting result, and if the new second counting result is smaller than a disappearance threshold, forming third object state data by using the prediction bounding box, the new second counting result, the motion speed, the track identifier and the first counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 frame;
and forming a state set of the t-th frame by using all the first object state data, the second object state data and the third object state data.
6. The method according to claim 4, wherein calculating the predicted bounding box of each object at the t-th frame based on the object state data of each object included in the state set of the t-1-th frame comprises:
calculating a second displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the t-1 th frame;
and calculating a predicted surrounding frame of each object in the t frame according to the surrounding frame in the object state data of each object included in the state set of the t-1 frame and the second displacement of each object in each dimension.
7. The method of claim 1, wherein detecting the plurality of objects contained in the sensor data of the tth frame, resulting in a plurality of detection bounding boxes of the tth frame, comprises:
and calling a three-dimensional target detector, detecting a plurality of objects included in the sensor data of the t-th frame, and obtaining a plurality of detection surrounding frames of the t-th frame.
8. An end-to-end real-time three-dimensional multi-target tracking device for automatic driving is characterized by comprising:
an execution unit for executing the prediction step and the state update step in parallel when the sensor data of the t-th frame is received during the automatic driving; wherein t is a positive integer;
the predicting step includes:
acquiring a state set of an r frame; wherein r is smaller than t, the state set of the r-th frame is a set obtained by latest updating, the state set of the r-th frame is obtained by updating the sensor data of the r-th frame, and the state set comprises object state data of a plurality of objects;
predicting a tracking result of the t frame according to the state set of the r frame; the tracking result is used for representing the position and the posture of each object in the real scene in the t frame;
the state updating step includes:
detecting a plurality of objects contained in the sensor data of the t-th frame to obtain a plurality of detection surrounding frames of the t-th frame, and acquiring a state set of the t-1-th frame when the state set is updated to the t-1-th frame;
associating each detection surrounding frame of the t frame with the state set of the t-1 frame to obtain the state set of the t frame;
and updating the state set of the t-1 th frame into the state set of the t-th frame.
9. The apparatus as claimed in claim 8, wherein the execution unit is configured to predict the tracking result of the t frame according to the state set of the r frame, and the execution unit is specifically configured to:
calculating a bounding box of each object in the t frame based on the object state data of each object included in the state set of the r frame;
for each object, forming a sub-tracking result of the object in the t frame by using a bounding box of the object in the t frame and a track identifier in object state data corresponding to the object in the state set of the r frame;
and forming the tracking result of the t-th frame by using each sub-tracking result.
10. The apparatus according to claim 9, wherein the execution unit is configured to calculate a bounding box of each object at a t-th frame based on the object state data of each object included in the state set of the r-th frame, and the execution unit is specifically configured to:
calculating a first displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the r-th frame;
and calculating the bounding box of each object in the t frame according to the bounding box in the object state data of each object included in the state set of the r frame and the first displacement of each object in each dimension.
CN202110441246.2A 2021-04-23 2021-04-23 End-to-end real-time three-dimensional multi-target tracking method and device for automatic driving Pending CN113096156A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110441246.2A CN113096156A (en) 2021-04-23 2021-04-23 End-to-end real-time three-dimensional multi-target tracking method and device for automatic driving

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110441246.2A CN113096156A (en) 2021-04-23 2021-04-23 End-to-end real-time three-dimensional multi-target tracking method and device for automatic driving

Publications (1)

Publication Number Publication Date
CN113096156A true CN113096156A (en) 2021-07-09

Family

ID=76679720

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110441246.2A Pending CN113096156A (en) 2021-04-23 2021-04-23 End-to-end real-time three-dimensional multi-target tracking method and device for automatic driving

Country Status (1)

Country Link
CN (1) CN113096156A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709938A (en) * 2016-11-18 2017-05-24 电子科技大学 Multi-target tracking method based on improved TLD (tracking-learning-detected)
WO2017185503A1 (en) * 2016-04-29 2017-11-02 高鹏 Target tracking method and apparatus
WO2020155873A1 (en) * 2019-02-02 2020-08-06 福州大学 Deep apparent features and adaptive aggregation network-based multi-face tracking method
CN111932580A (en) * 2020-07-03 2020-11-13 江苏大学 Road 3D vehicle tracking method and system based on Kalman filtering and Hungary algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017185503A1 (en) * 2016-04-29 2017-11-02 高鹏 Target tracking method and apparatus
CN106709938A (en) * 2016-11-18 2017-05-24 电子科技大学 Multi-target tracking method based on improved TLD (tracking-learning-detected)
WO2020155873A1 (en) * 2019-02-02 2020-08-06 福州大学 Deep apparent features and adaptive aggregation network-based multi-face tracking method
CN111932580A (en) * 2020-07-03 2020-11-13 江苏大学 Road 3D vehicle tracking method and system based on Kalman filtering and Hungary algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘玉杰;窦长红;赵其鲁;李宗民;: "基于状态预测和运动结构的在线多目标跟踪", 计算机辅助设计与图形学学报, no. 02 *
郑晓萌;张德海;: "基于有效特征点的运动目标匹配跟踪算法", 电子设计工程, no. 20 *

Similar Documents

Publication Publication Date Title
US20210190497A1 (en) Simultaneous location and mapping (slam) using dual event cameras
US11024041B2 (en) Depth and motion estimations in machine learning environments
US20210090327A1 (en) Neural network processing for multi-object 3d modeling
CN108805898B (en) Video image processing method and device
Collins et al. Infinitesimal plane-based pose estimation
WO2020232174A1 (en) Distributed pose estimation
US20190371024A1 (en) Methods and Systems For Exploiting Per-Pixel Motion Conflicts to Extract Primary and Secondary Motions in Augmented Reality Systems
Xu et al. FollowUpAR: Enabling follow-up effects in mobile AR applications
CN111127584A (en) Method and device for establishing visual map, electronic equipment and storage medium
CN112540609A (en) Path planning method and device, terminal equipment and storage medium
JP2020052484A (en) Object recognition camera system, relearning system, and object recognition program
CN115147683A (en) Pose estimation network model training method, pose estimation method and device
US20200164508A1 (en) System and Method for Probabilistic Multi-Robot Positioning
CN113091736A (en) Robot positioning method, device, robot and storage medium
CN113096156A (en) End-to-end real-time three-dimensional multi-target tracking method and device for automatic driving
CN117437348A (en) Computing device and model generation method
WO2020103495A1 (en) Exposure duration adjustment method and device, electronic apparatus, and storage medium
Ayush et al. Real time visual SLAM using cloud computing
Nicolai et al. Model-free (human) tracking based on ground truth with time delay: A 3D camera based approach for minimizing tracking latency and increasing tracking quality
US20180001821A1 (en) Environment perception using a surrounding monitoring system
US11921824B1 (en) Sensor data fusion using cross-modal transformer
CN109410304A (en) A kind of determining method, device and equipment of projection
CN115937383B (en) Method, device, electronic equipment and storage medium for rendering image
US10636205B2 (en) Systems and methods for outlier edge rejection
US20230154026A1 (en) Method and apparatus with pose estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination