CN114549578A

CN114549578A - Target tracking method, device and storage medium

Info

Publication number: CN114549578A
Application number: CN202111308752.0A
Authority: CN
Inventors: 梁浩; 武鹏
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-05-27
Also published as: WO2023077754A1

Abstract

The present disclosure relates to a target tracking method, apparatus, and storage medium, the method comprising: acquiring an image acquisition sequence; carrying out object detection on a first image in an image acquisition sequence to obtain a 3D detection frame of an object on the first image in a target three-dimensional space and a 2D detection frame on the first image, wherein the first image is any one of images except a first image in the image acquisition sequence; predicting a 3D prediction frame of the target object in the target three-dimensional space and a 2D prediction frame on the first image according to a tracking result of tracking the target object by aiming at a second image, wherein the second image is a last image of the first image in the image acquisition sequence; and determining a tracking result for tracking the target object aiming at the first image according to the 3D detection frame, the 2D detection frame, the 3D prediction frame and the 2D prediction frame. By the target tracking method, the tracking accuracy of the target object can be improved.

Description

Target tracking method, device and storage medium

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to a target tracking method, an apparatus, and a storage medium.

Background

Target tracking is to find a target in an image sequence by giving the image sequence, identify the same target of different frames, and assign an ID to the same target of different frames. In the related art, target tracking is generally performed based on 2D information, however, it is difficult to perform accurate motion estimation on a target using only 2D information, resulting in a case where tracking error occurs.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a target tracking method, apparatus, and storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a target tracking method, including:

acquiring an image acquisition sequence, wherein the image acquisition sequence is obtained according to acquired images of image acquisition equipment at a plurality of acquisition moments;

carrying out object detection on a first image in the image acquisition sequence to obtain a 3D detection frame of an object on the first image in a target three-dimensional space and a 2D detection frame on the first image, wherein the first image is any one image except a first image in the image acquisition sequence;

predicting a 3D prediction frame of a target object in the target three-dimensional space and a 2D prediction frame on the first image according to a tracking result of tracking the target object for a second image, wherein the second image is a previous image of the first image in the image acquisition sequence;

determining a tracking result for tracking the target object with respect to the first image according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame.

In some embodiments, the determining, from the 3D detection block, the 2D detection block, the 3D prediction block, and the 2D prediction block, a tracking result of tracking the target object for the first image comprises:

determining a target 3D detection frame matched with the 3D prediction frame from each 3D detection frame according to a first intersection ratio and/or a distance value between the 3D prediction frame and each 3D detection frame;

and taking an object corresponding to the target 3D detection frame in the target three-dimensional space as the target object, and taking the position information of the target 3D detection frame in the target three-dimensional space as the 3D tracking position information of the target object, wherein the tracking result comprises the 3D tracking position information.

In some embodiments, the determining, according to the 3D detection block, the 2D detection block, the 3D prediction block, and the 2D prediction block, a tracking result of tracking the target object for the first image further includes:

for a 3D detection frame which is not matched with the 3D prediction frame in each 3D detection frame, determining a 2D detection frame of the same object corresponding to the 3D detection frame, and determining a target 2D detection frame matched with the 2D prediction frame according to a second intersection ratio between the 2D detection frame and the 2D prediction frame;

and taking an object corresponding to the target 2D detection frame on the first image as the target object, and taking position information of the target 2D detection frame on the first image as 2D tracking position information of the target object, wherein the tracking result comprises the 2D tracking position information.

In some embodiments, the tracking result includes 3D tracking position information of the target object corresponding to the second image, 2D tracking position information, and motion data of the target object;

predicting a 3D prediction frame of a target object in the three-dimensional space and a 2D prediction frame on the first image according to a tracking result of tracking the target object for the second image, including:

updating a tracker according to the motion data;

and inputting the 3D tracking position information and the 2D tracking position information into an updated tracker to obtain the 3D prediction frame and the 2D prediction frame output by the tracker.

In some embodiments, the motion data includes a rate of change of position of the target object on the image, and a velocity and an acceleration of the target object in the target three-dimensional space;

the tracker is capable of outputting the 2D prediction frame based on the position change rate and the 2D tracking position information, and outputting the 3D prediction frame based on the velocity, the acceleration, and the 3D tracking position information.

In some embodiments, the first image comprises a plurality of captured images, the plurality of captured images being images captured by a plurality of image capture devices at a same capture time;

the object detection of the first image in the image acquisition sequence to obtain a 3D detection frame of the object on the first image in the target three-dimensional space and a 2D detection frame on the first image includes:

carrying out object detection on the plurality of collected images to obtain a 3D detection frame of an object on each collected image in a three-dimensional space of each image collecting device and a 2D detection frame on each collected image;

mapping 3D detection frames positioned in three-dimensional spaces of different image acquisition devices to the same target coordinate system according to the external parameters of each image acquisition device, wherein the target three-dimensional space is a space limited by the target coordinate system;

and splicing the plurality of collected images to obtain a spliced image, mapping the 2D detection frames on the plurality of collected images to the spliced image, wherein the 2D detection frame on the first image is the 2D detection frame on the spliced image.

In some embodiments, before the determining a tracking result of tracking the target object for the first image according to the 3D detection block, the 2D detection block, the 3D prediction block, and the 2D prediction block, the method further comprises:

performing non-maximum suppression processing on the 3D detection box, the 2D detection box, the 3D prediction box, and the 2D prediction box.

According to a second aspect of the embodiments of the present disclosure, there is provided a target tracking apparatus including:

an acquisition module configured to acquire an image acquisition sequence, the image acquisition sequence being derived from acquired images of an image acquisition device at a plurality of acquisition moments;

the detection module is configured to perform object detection on a first image in the image acquisition sequence to obtain a 3D detection frame of an object on the first image in a target three-dimensional space and a 2D detection frame on the first image, wherein the first image is any one of the images in the image acquisition sequence except for a first image;

a prediction module configured to predict a 3D prediction frame of a target object in the target three-dimensional space and a 2D prediction frame on the first image according to a tracking result of tracking the target object for a second image, the second image being a previous image of the first image in the image acquisition sequence;

a determination module configured to determine a tracking result of tracking the target object for the first image according to the 3D detection box, the 2D detection box, the 3D prediction box, and the 2D prediction box.

In some embodiments, the determination module is further configured to:

the prediction module is further configured to:

updating a tracker according to the motion data;

the detection module is further configured to:

In some embodiments, the determination module is further configured to:

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the target tracking method provided by the first aspect of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the tracking result of the target object is determined through the 3D frame information (for example, the 3D detection frame and the 3D prediction frame) and the 2D frame information (for example, the 2D detection frame and the 2D prediction frame), and the 3D frame information is introduced, so that the target tracking method disclosed by the invention can continuously perform motion estimation on the target object in a target three-dimensional space in a period of time after the target object is lost, the successful matching probability after the target object reappears is improved, ID switching caused by the situations of target missing detection or visual field emergence is reduced, and the situation of wrong tracking of the target is reduced. Meanwhile, the tracking result of the target object is determined by combining the 3D frame information and the 2D frame information, so that the tracking accuracy of the target object can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating a method of target tracking according to an exemplary embodiment.

Fig. 2 is a flow diagram illustrating the determination of a 3D detection box and a 2D detection box according to an example embodiment.

FIG. 3 is a block diagram illustrating a target tracking device according to an exemplary embodiment.

FIG. 4 is a block diagram illustrating an apparatus for target tracking in accordance with an exemplary embodiment.

FIG. 5 is a block diagram illustrating an apparatus for target tracking in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In some embodiments, the target tracking methods of the present disclosure may be applied to different scenarios. For example, it may be applied to an autonomous driving scenario, tracking a target in an image captured by an image capture device on a vehicle. For another example, the method can be applied to a traffic monitoring scene to track the target in the image acquired by the image acquisition device in the traffic monitoring system. It should be understood that the application scenarios of the target tracking method mentioned in this disclosure are only some examples or embodiments of the disclosure, and it is obvious for those skilled in the art that the target tracking method can also be applied to other similar scenarios without creative efforts, for example, the target tracking of a mobile robot can also be applied, and the disclosure is not limited thereto.

In the related art, 2D information is generally used to track a target in an image captured by one or more image capturing devices, however, since the projection of the target onto the 2D image has affine transformation, it is difficult to perform accurate motion estimation on the target on the 2D image, and thus, the target cannot be accurately tracked by using the 2D information, resulting in an ID that is a target matching error. When the target is tracked by using the 2D information, the target is difficult to find once lost. In addition, in the related art, generally, the images acquired by the plurality of image acquisition devices are respectively tracked, which is not only inefficient, but also unable to process targets overlapped in the images of the adjacent image acquisition devices.

FIG. 1 is a flow chart illustrating a method of target tracking, as shown in FIG. 1, including the following steps, according to an exemplary embodiment.

Step 110, an image acquisition sequence is obtained, wherein the image acquisition sequence is obtained according to the acquired images of the image acquisition device at a plurality of acquisition moments.

In some embodiments, the image acquisition sequence may be derived from the acquisition of images by one or more image acquisition devices at multiple acquisition times. For an image acquisition device, the acquired image at each acquisition time in the image acquisition sequence may be the acquired image of the image acquisition device at the acquisition time; for a plurality of image acquisition devices, the acquired image at each acquisition instant in the image acquisition sequence may be the acquired image at that acquisition instant of the plurality of image acquisition devices.

Illustratively, one image capturing apparatus is taken as the image capturing apparatus 1, and the image capturing apparatus 1 captures a plurality of capturing times t₁、t₂And t₃Is P₁，P₂，P₃For example, the image acquisition sequence 1 may be (P)₁，P₂，P₃). Taking a plurality of image capturing devices as the image capturing devices 1-3 as an example, if the image capturing device 1 is at a plurality of capturing moments t₁、t₂And t₃Is P₁₁、P₁₂、P₁₃The image acquisition device 2 at a plurality of acquisition instants t₁、t₂And t₃Is P₂₁、P₂₂、P₂₃The image acquisition device 3 at a plurality of acquisition instants t₁、t₂And t₃Is P₃₁、P₃₂、P₃₃For example, the image acquisition sequence 2 may be (P)₁₁P₂₁P₁₁，P₁₂P₂₂P₃₂，P₁₃P₂₃P₃₃)。

In some embodiments, the image capture device may include, but is not limited to, a video camera and a camera. The image acquisition equipment can be arranged on a preset fixed position or in the mobile equipment, and the preset fixed position and the mobile equipment can be specifically arranged according to actual requirements. For example, the mobile device may be an autonomous vehicle. In some embodiments, the image capture device may be one or more cameras included in an autonomous vehicle.

In some embodiments, when the image capturing apparatus is plural, the capturing directions of the plural image capturing apparatuses may be different. Illustratively, still taking a plurality of image capturing apparatuses as the image capturing apparatuses 1 to 3 as an example, the capturing directions of the image capturing apparatuses 1 to 3 may be a left direction, a forward direction, a right direction, and the like, respectively. It should be noted that the capturing directions of the plurality of image capturing devices may be specifically set according to actual situations, and the present disclosure does not set any limitation thereto.

In some embodiments, an image acquisition sequence may be acquired from video acquired by one or more image acquisition devices. Illustratively, the captured images in the image capture sequence may be image frames included in a video.

And 120, performing object detection on a first image in the image acquisition sequence to obtain a 3D detection frame of the object on the first image in the target three-dimensional space and a 2D detection frame on the first image, wherein the first image is any one of the images in the image acquisition sequence except the first image.

In some embodiments, an object on the first image may refer to one or more objects included in the first image, which may include different classes of objects. For example, taking the first image as the road condition image as an example, the object on the first image may include an object in a pedestrian category, an object in a vehicle category, and the like.

In some embodiments, the object detection may be performed on the first image according to a monocular 3D detection algorithm. In some embodiments, Monocular 3D Detection algorithms may include, but are not limited to, the full convolution single-Stage 3D Monocular Object Detection method (FCOS 3D) and Real-time Monocular 3D Object Detection algorithm (RTM 3D).

A 3D detection frame of an object included in an image in a three-dimensional space (e.g., a camera coordinate system) of an image capturing device and a 2D detection frame in an image coordinate system of the image can be simultaneously obtained by a monocular 3D detection algorithm. In some embodiments, the 3D detection frame in the three-dimensional space of the image capturing device obtained by the detection algorithm may be represented by (x, y, z, rot, w, h, l), where (x, y, z) may represent coordinates of a central point of the 3D detection frame in the three-dimensional space of the image capturing device, rot may represent a heading angle of the 3D detection frame, and (w, h, l) may represent a width, a height, and a length of the 3D detection frame, respectively. The 2D detection box obtained by the detection algorithm may be represented by (x1, y1, x2, y2), wherein (x1, y1) may represent the coordinates of the upper left corner of the 2D detection box in the image coordinate system, and (x2, y2) may represent the coordinates of the lower right corner of the 2D detection box in the image coordinate system.

As mentioned above, the image capturing devices may include one or more, and for one image capturing device, the first image may be one captured image, and for example, still taking the above-mentioned image capturing sequence 1 as an example, the first image may be P₂Or P₃. In some embodiments, when the first image is a captured image, the captured image may be subject-detected, resulting in a 3D detection frame of the subject on the captured image in the three-dimensional space of the image capturing device, and a 2D detection frame on the captured image, i.e. a 2D detection frame in the image coordinate system of the captured image. In some embodiments, when the first image is a captured image, the 3D detection frame in the three-dimensional space of the image capturing device may be determined as the 3D detection frame in the target three-dimensional space, or the 3D detection frame in the three-dimensional space of the image capturing device may be mapped to the target coordinate system, and the 3D detection frame in the target coordinate system may be determined as the 3D detection frame in the target three-dimensional space. For specific details of the target coordinate system, reference may be made to fig. 2 and the related description thereof, which are not repeated herein.

For a plurality of image capturing devices, the first image may include a plurality of captured images, and the plurality of captured images may be images captured by the plurality of image capturing devices at the same capturing time. Illustratively, still taking the above-described image acquisition sequence 2 as an example, the first image may comprise P₁₂、P₂₂And P₃₂Or includes P₁₃、P₂₃And P₃₃. In some embodiments, when the first image is a plurality of captured images, object detection may be performed on each of the plurality of captured images, resulting in a 3D detection frame of the object on each captured image in the three-dimensional space of each image capture device and a 2D detection frame on the captured image. In some embodiments, the 3D detection frame in the three-dimensional space in each image capturing device and the 2D detection frame on each captured image may be processed separately, resulting in a 3D detection frame in the target three-dimensional space and a 2D detection frame on the first image. For specific details of obtaining the 3D detection frame and the 2D detection frame when the first image is a plurality of acquired images, reference may be made to fig. 2 and related description thereof, which are not repeated herein.

And step 130, predicting a 3D prediction frame of the target object in the target three-dimensional space and a 2D prediction frame on the first image according to a tracking result of tracking the target object by aiming at a second image, wherein the second image is a previous image of the first image in the image acquisition sequence.

In some embodiments, the first image and the second image may be images acquired by the image acquisition device at different acquisition moments, wherein the second image is an image obtained by the image acquisition device at an acquisition moment immediately preceding the acquisition moment of the first image, i.e. the second image is an image immediately preceding the first image in the image acquisition sequence.

In some embodiments, the target object may be one or more targets included in the second image, and the target object may include different categories of targets. For example, an object of a pedestrian category, an object of a vehicle category, and the like. In some embodiments, the tracking result of tracking the target object with respect to the second image may include 3D tracking position information of the target object corresponding to the second image, 2D tracking position information, and motion data of the target object. For determining the 3D tracking position information and the 2D tracking position information, reference may be made to step 140 and the related description thereof, which are not described herein again.

In some embodiments, the motion data may include a rate of change of position of the target object on the image and a velocity and acceleration of the target object in the target three-dimensional space. The position change rate of the target object on the image may refer to a position change rate between 2D tracking information of the target object on the second image and 2D tracking information of the target object on the first image. For details of determining the rate of change of position, reference may be made to the following description of the tracker, which is not repeated here.

In some embodiments, the velocity and acceleration of the target object in the target three-dimensional space may refer to the velocity and acceleration of the target object at the corresponding acquisition time of the second image. For example, the target object is a pedestrian, and the second image corresponds to the acquisition time t₂For example, the velocity and acceleration of the target object in the target three-dimensional space may be the pedestrian at t₂The speed and the acceleration of the moment, which can be understood as the walking speed and the acceleration of the pedestrian in the real space. In some embodiments, when the image capturing device is disposed in the mobile device, the velocity and acceleration of the mobile device at the capturing time of the second image may be determined as the velocity and acceleration of the target object in the target three-dimensional space. For example, taking the mobile device as an autonomous vehicle, the motion data may be that the vehicle is at t₂The speed and acceleration at the moment, as will be understood, are the speed and acceleration at which the vehicle is traveling in real space.

In some embodiments, a 3D prediction frame of the target object in the target three-dimensional space and a 2D prediction frame on the first image may be predicted according to a tracking result of the tracker that tracks the target object for the second image. In some embodiments, predicting a 3D prediction frame of the target object in the target three-dimensional space and a 2D prediction frame on the first image according to a tracking result of tracking the target object for the second image includes: updating the tracker based on the motion data; and inputting the 3D tracking position information and the 2D tracking position information into the updated tracker to obtain a 3D prediction frame and a 2D prediction frame output by the tracker.

In some embodiments, the tracker can output a 2D prediction box based on the rate of change of position and the 2D tracking position information, and a 3D prediction box based on the velocity, acceleration and 3D tracking position. In some embodiments, the value of the rotation angle of the 3D prediction box set by the tracker remains unchanged.

In some embodiments, the 2D tracking information of the target object may be a 2D detection box to which the 2D tracking information corresponds. In some embodiments, the 2D tracking information may be characterized by (cx, cy, w, h), where (cx, cy) characterizes coordinates of a center point of the 2D detection frame corresponding to the 2D tracking information in an image coordinate system of the first image, and (w, h) may characterize a width and a height of the 2D detection frame. In some embodiments, the 3D tracking information of the target object may be a 3D detection box to which the 3D tracking information corresponds. In some embodiments, the 3D tracking information may be characterized by (x, y, rot), where (x, y) may characterize coordinates of a center point of a 3D detection frame corresponding to the 3D tracking information in the target three-dimensional space, and rot characterizes a rotation angle of the 3D detection frame.

In some embodiments, a 3D prediction frame of each target object in the target three-dimensional space and a 2D prediction frame on the first image may be predicted according to a tracking result of the plurality of trackers tracking each target object for the second image. In some embodiments, for any target object, the tracker corresponding to the target object may include a state transition function of the target object, and predict a 3D prediction frame of the target object in the target three-dimensional space and a 2D prediction frame on the first image according to the state transition function of the target object and a tracking result of tracking the target object for the second image.

In some embodiments, for any target object, the state transition function of the target object included in the tracker corresponding to the target object is the following formula (1):

for the meanings of (cx, cy, w, h) and (x, y, rot), the above description can be referred to, and the details are not repeated herein. (V)_CX,V_Cy,V_w,V_h) The position change rate of the target object on the image is represented, corresponding to (cx, cy, w, h), V_xAnd V_yRespectively representing the speed of the target object along the X-axis and Y-axis directions of the target three-dimensional space, a_xAnd a_yRespectively representing the acceleration of the target object along the X-axis direction and the Y-axis direction of the target three-dimensional space.

In some embodiments, when the second image is the first image in the image acquisition sequence, the rate of change of the position of the target object on the image is an initial value of 0. Illustratively, still with the above-described image acquisition sequence 1 "(P)₁，P₂，P₃) "for example, assume the first picture is P₂Including the target objects 1-5, the second image being P₁Including objects 1-5, the process by which the tracker 1 corresponding to the target object 1 predicts the 3D prediction frame and the 2D prediction frame of the target object 1 for each captured image in the image capture sequence 1 will be described below with reference to an example. First, by applying a second image P₁Performing object detection to obtain a second image P corresponding to the target object 1₁Due to the second image P, the 2D detection frame and the 3D detection frame of₁The first image is the image, in which case the position change rate 1 corresponding to the target object 1 is 0, and the target object 1 corresponds to the second image P₁The 3D tracking position information and the 2D tracking position information of the target object 1 are the 3D detection frame and the 2D detection frame obtained by the object detection thereof (i.e., the 3D detection frame and the 2D detection frame obtained through the above-mentioned step 120), and at the same time, the target object 1 is in the second image P₁The corresponding speed 1 and the corresponding acceleration 1 at the acquisition time can be detected, the tracker 1 is updated according to the position change rate 1, the speed 1 and the acceleration 1, namely, the formula (1) is updated, and the 3D tracking position information and the 2D tracking position information are input into the updated tracker 1, so that the image (namely, the first image P) of the target object 1 corresponding to the next time can be obtained₂) The 3D prediction block and the 2D prediction block of (3).Further, the first image P₂The 3D detection frame and the 2D detection frame of the object above can be obtained according to the object detection, and the first image P can be obtained by the following method of step 140₂The object belonging to the same target as the target object 1 is assumed to be the object 1, and further, the 2D detection frame corresponding to the object 1 and the 2D detection frame of the target object 1 can be used to obtain the second image P of the target object 1₁And a first image P₂Upper rate of change of position 2, by the rate of change of position 2 and the target object 1 in the first image P₂The speed and the acceleration of the acquisition time can obtain a third image P corresponding to the target object 1₃The 3D prediction block and the 2D prediction block. Therefore, the tracker 1 corresponding to the target object 1 can obtain the position information of the target object 1 on each acquired image in the image acquisition sequence, and the tracking of the target object 1 on the image sequence is realized.

In the embodiment of the present specification, the 2D prediction frame is output according to the tracker based on the position change rate and the 2D tracking position information, that is, the tracker uses a uniform velocity model to realize prediction or motion estimation on the 2D frame, so that the calculation amount for predicting the 2D frame can be reduced. The 3D prediction frame is output based on the speed, the acceleration and the 3D tracking position information according to the tracker, namely the tracker adopts a uniform acceleration model to realize the prediction or motion estimation of the 3D frame, so that the prediction accuracy of the 3D frame can be improved. And the uncertainty of the angle of the 3D detection frame obtained by object detection is large, the rotation angle of the 3D prediction frame in the tracker is kept unchanged, and the rotation angle is smoothed by using a static model, so that the prediction accuracy of the 3D frame is further improved.

And 140, determining a tracking result for tracking the target object aiming at the first image according to the 3D detection frame, the 2D detection frame, the 3D prediction frame and the 2D prediction frame.

In some embodiments, before determining a tracking result of tracking the target object for the first image according to the 3D detection block, the 2D detection block, the 3D prediction block, and the 2D prediction block, the method further comprises: non-maximum suppression processing is performed on the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame. By executing the non-maximum suppression processing, overlapped 3D frames (i.e., 3D detection frames or 3D prediction frames) and 2D frames (i.e., 2D detection frames or 2D prediction frames) can be filtered out, so that the overlapped frames are prevented from influencing the matching between subsequent 3D frames and the matching between 2D frames, the accuracy of subsequent matching is improved, and the accuracy of target tracking is improved.

In some embodiments, determining a tracking result of tracking the target object for the first image according to the 3D detection box, the 2D detection box, the 3D prediction box, and the 2D prediction box includes: determining a target 3D detection frame matched with the 3D prediction frame from each 3D detection frame according to a first intersection ratio and/or a distance value between the 3D prediction frame and each 3D detection frame; and taking an object corresponding to the target 3D detection frame in the target three-dimensional space as the target object, taking the position information of the target 3D detection frame in the target three-dimensional space as the 3D tracking position information of the target object, wherein the tracking result comprises the 3D tracking position information.

In some embodiments, the first intersection ratio may refer to an overlap ratio between the 3D prediction box and the 3D detection box, i.e., a ratio of an intersection and a union of the 3D prediction box and the 3D detection box. In some embodiments, the target object may be one or more, and correspondingly, the 3D prediction box of the target object in the target three-dimensional space may also include one or more. In some embodiments, for a plurality of 3D prediction blocks, a first intersection ratio matrix may be determined based on a first intersection ratio value between each 3D prediction block and each 3D detection block. In some embodiments, the distance value may be a distance between a center point of the 3D prediction box and the 3D detection box in the target three-dimensional space. The distance may include, but is not limited to, a manhattan distance or a euclidean distance, etc. In some embodiments, for multiple 3D prediction blocks, a distance matrix may be determined based on distance values between each 3D prediction block and each 3D detection block. In the embodiment of the description, the first intersection-to-parallel ratio matrix or the distance matrix can be solved according to the Hungarian algorithm, the matching result between the 3D prediction boxes and the 3D detection boxes is determined, and the target 3D detection box matched with each 3D prediction box is determined from the 3D detection boxes.

As previously mentioned, the target objects or objects may be different categories of targets. In some embodiments, for target objects and objects of the vehicle category, a first intersection ratio matrix may be determined based on a first intersection ratio value between each 3D prediction box and each 3D detection box. For target objects and objects of the pedestrian category, a distance matrix may be determined based on distance values between each 3D prediction box and each 3D detection box. In the embodiment of the description, the first intersection-to-parallel ratio matrix and the distance matrix can be solved respectively according to the Hungarian algorithm, the matching result between the 3D prediction box and the 3D detection box is determined, and the target 3D detection box matched with each 3D prediction box is determined from the 3D detection boxes.

Illustratively, the 3D prediction frames of the target objects 1-5 are respectively N with the second image comprising the target objects 1-5_3D-1、N_3D-2、N_3D-3、n_3D-4、n_3D-5The first image comprises objects 1-5, and the 3D detection frames of the objects 1-5 are respectively M_3D-1、M_3D-2、M_3D-3、M_3D-4、M_3D-5For example, if the first cross-comparison matrix and the distance matrix are solved according to the hungarian algorithm, the matching result is obtained as follows: n is a radical of_3D-1-M_3D-2，N_3D-2-M_3D-1，N_3D-3-M_3D-3Then with the 3D prediction box N_3D-1The matched target 3D detection frame is a 3D detection frame M_3D-2And 3D prediction frame N_3D-2The matched target 3D detection frame is a 3D detection frame M_3D-1And 3D prediction frame N_3D-3The matched target 3D detection frame is a 3D detection frame M_3D-3。

In some embodiments, determining a tracking result of tracking the target object for the first image according to the 3D detection block, the 2D detection block, the 3D prediction block, and the 2D prediction block further comprises: aiming at a 3D detection frame which is not matched with a 3D prediction frame in each 3D detection frame, determining a 2D detection frame of the same object corresponding to the 3D detection frame, and determining a target 2D detection frame matched with the 2D prediction frame according to a second intersection ratio between the 2D detection frame and the 2D prediction frame; and taking an object corresponding to the target 2D detection frame on the first image as a target object, and taking the position information of the target 2D detection frame on the first image as 2D tracking position information of the target object, wherein the tracking result comprises the 2D tracking position information.

The determination manner of the second intersection ratio between the 2D detection frame and the 2D prediction frame is similar to the first intersection ratio, and is not repeated herein. In some embodiments, a second intersection-comparison matrix may be determined according to a second intersection-ratio value between the 2D detection box and the 2D prediction box, the second intersection-comparison matrix is solved according to the hungarian algorithm, and a matching result between the 2D detection box and the 2D prediction box is determined, so that the target 2D detection box matched with the 2D prediction box is determined.

Illustratively, still taking the foregoing example as an example, then a 3D detection box of the 3D detection boxes that is not matched to the 3D prediction box comprises M_3D-4And M_3D-5Determining the 3D detection frame M_3D-4And M_3D-52D detection frame M corresponding to same object_2D-4And M_2D-5And 2D detection frame M_2D-4And M_2D-5And 2D prediction frame N_2D-1、N_2D-2、N_2D-3、N_2D-4、N_2D-5(namely the 2D prediction frames of the target objects 1-5) and solving the second cross-over comparison matrix according to Hungarian algorithm to obtain a matching result M between the 2D detection frame and the 2D prediction frame_2D-4-N_2D-4，M_2D-5-N_2D-5。

The 3D prediction frame and the 2D prediction frame are a 3D frame and a 2D frame corresponding to a target object in the second image, the 3D detection frame and the 3D detection frame are a 3D frame and a 2D frame corresponding to an object in the first image, and the target object and the object which belong to the same target in the second image and the first image can be determined by matching the prediction frame and the detection frame, so that target tracking is realized. For example, with the above matching result M_2D-4-N_2D-4For example, it may be determined that the object 4 in the second image at the previous time belongs to the same target as the target object 4 in the first image at the next time. In some embodiments, a target object belonging to the same target may be given the same ID as the object.

In the embodiment of the present specification, a 3D detection frame and a 3D prediction frame are first matched, and then a 2D detection frame and a 2D prediction frame are matched, and two-stage matching is adopted. The matching using the 3D frames (i.e., the 3D detection frame and the 3D prediction frame) has high accuracy and a small probability of mismatching, and the matching using the 2D frames (i.e., the 2D detection frame and the 2D prediction frame) has a low probability of missing matching. Therefore, high-precision matching can be realized through two-stage matching, and meanwhile, the missing matching situation is reduced, namely, the matching precision of the target object in the image at the previous moment and the target object in the image at the next moment is improved, the missing matching situation of the target object and the target is reduced, and the tracking accuracy of the same target in the images at different moments is further improved.

In some embodiments, a new tracker may be created for a target object for which neither the 3D prediction box nor the 2D prediction box match successfully. In some embodiments, trackers corresponding to target objects for which the 3D prediction box and the 2D prediction box are not successfully matched for a preset number of times may be discarded. The preset times can be specifically set according to actual conditions. As can be seen from the above, the 3D prediction frame and the 2D prediction frame correspond to two-stage matching, and for a target object that has no matching result in the two-stage matching for a preset number of times, it may be considered that the tracker has failed to track the target object, and the tracker of the target object is discarded.

In the embodiment of the disclosure, the tracking result of the target object is determined through the 3D frame and 2D frame information, and the 3D information is introduced, so that the target tracking method of the disclosure can continue to perform motion estimation in the target three-dimensional space on the target object in a period of time after the target object is lost, thereby improving the successful matching probability after the target object reappears, and reducing ID switching caused by situations such as missing detection or visual field exit of the target, i.e., reducing the situation of wrong tracking of the target.

Fig. 2 is a flowchart illustrating the determination of a 3D detection box and a 2D detection box according to an exemplary embodiment, and as shown in fig. 2, the method includes the following steps.

Step 210, performing object detection on the plurality of collected images to obtain a 3D detection frame of an object on each collected image in a three-dimensional space of each image collection device and a 2D detection frame on the collected image.

The details of step 210 are similar to step 120, and refer to step 210 and the related description thereof, which are not repeated herein.

Illustratively, the aforementioned plurality of captured images are still taken as P₁，P₂，P₃For example, the captured image P may be processed₁Performing object detection to obtain an acquired image P₁3D detection frame of the object in the three-dimensional space of the image acquisition device 1 and in the acquired image P₁2D detection frame under the image coordinate system of (1); for the collected image P₂And P₃Object detection and acquisition image P₁Similarly, no further description is provided herein.

And step 220, mapping the 3D detection frames in the three-dimensional spaces of different image acquisition devices to the same target coordinate system according to the external parameters of each image acquisition device, wherein the target three-dimensional space is a space limited by the target coordinate system.

In some embodiments, the target coordinate system may be determined according to the position set by the image acquisition device. For example, if the image capture device is disposed in an autonomous vehicle, the target coordinate system may be the own vehicle coordinate system corresponding to the first image in the image capture sequence. For another example, if the image capturing device is disposed at a predetermined fixed position, the target coordinate system may be a coordinate system determined based on the predetermined fixed position, and the origin, the X axis, the Y axis, and the Z axis of the coordinate system may be specifically set according to actual conditions.

In some embodiments, the external reference of each image capture coordinate system may reflect a pose relationship between the image capture coordinate system and the target coordinate system. The external parameters may include translation parameters and rotation parameters. In some embodiments, the external reference for each image acquisition coordinate system may be obtained by calibration of the image acquisition device. For the calibration of the image capturing device, reference may be made to the related art, which is not described herein in detail.

And step 230, splicing the plurality of collected images to obtain a spliced image, and mapping the 2D detection frames on the plurality of collected images to the spliced image, wherein the 2D detection frame on the first image is the 2D detection frame on the spliced image.

In some embodiments, the 2D detection frame on the stitched image may refer to a 2D detection frame in an image coordinate system of the stitched image. By mapping the 2D detection frames on the plurality of collected images onto the stitched image, the 2D detection frame in the image coordinate system of each collected image can be converted into the image coordinate system of the stitched image, that is, the 2D detection frames corresponding to different image coordinate systems are converted into the same image coordinate system.

In the embodiment of the present specification, by mapping 3D detection frames located in three-dimensional spaces of different image acquisition devices to the same target coordinate system and mapping 2D detection frames on a plurality of acquired images to a stitched image, that is, by fusing detection results of the plurality of acquired images of the plurality of image acquisition devices, targets in the plurality of acquired images can be tracked simultaneously, so that tracking of the same target in different image acquisition images can be realized only by executing a tracking algorithm once, the problem of low efficiency caused by tracking the target in the acquired image of each image acquisition device independently is avoided, and the problem of ID switching of the same target in different image acquisition devices is reduced.

FIG. 3 is a block diagram illustrating a target tracking device 300 according to an example embodiment. Referring to fig. 3, the apparatus includes an acquisition module 310, a detection module 320, a prediction module 330, and a determination module 340.

An obtaining module 310 configured to obtain an image acquisition sequence, wherein the image acquisition sequence is obtained according to acquired images of an image acquisition device at a plurality of acquisition moments;

a detection module 320 configured to perform object detection on a first image in the image acquisition sequence, so as to obtain a 3D detection frame of an object on the first image in a target three-dimensional space and a 2D detection frame on the first image, where the first image is any image in the image acquisition sequence except for a first image;

a prediction module 330 configured to predict a 3D prediction frame of a target object in the target three-dimensional space and a 2D prediction frame on the first image according to a tracking result of tracking the target object for a second image, the second image being a previous image of the first image in the image acquisition sequence;

a determining module 340 configured to determine a tracking result of tracking the target object for the first image according to the 3D detection box, the 2D detection box, the 3D prediction box, and the 2D prediction box.

In some embodiments, the determination module 340 is further configured to:

the prediction module 330 is further configured to:

updating a tracker according to the motion data;

the tracker is capable of outputting the 2D prediction box based on the position change rate and the 2D tracking position information, and outputting the 3D prediction box based on the velocity, the acceleration, and the 3D tracking position information.

the detection module 320 is further configured to:

In some embodiments, the determination module 340 is further configured to:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the object tracking method provided by the present disclosure.

FIG. 4 is a block diagram illustrating an apparatus 400 for target tracking according to an example embodiment. For example, the apparatus 400 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 4, the apparatus 400 may include one or more of the following components: a processing component 402, a memory 404, a power component 406, a multimedia component 408, an audio component 410, an interface for input/output (I/O) 412, a sensor component 414, and a communication component 416.

The processing component 402 generally controls overall operation of the apparatus 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or a portion of the steps of the above-described method of target tracking. Further, processing component 402 may include one or more modules that facilitate interaction between processing component 402 and other components. For example, the processing component 402 can include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.

The memory 404 is configured to store various types of data to support operations at the apparatus 400. Examples of such data include instructions for any application or method operating on the device 400, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 404 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 406 provide power to the various components of device 400. Power components 406 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for apparatus 400.

The multimedia component 408 includes a screen that provides an output interface between the device 400 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 408 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 400 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 410 is configured to output and/or input audio signals. For example, audio component 410 includes a Microphone (MIC) configured to receive external audio signals when apparatus 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 also includes a speaker for outputting audio signals.

The I/O interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 414 includes one or more sensors for providing various aspects of status assessment for the apparatus 400. For example, the sensor assembly 414 may detect an open/closed state of the apparatus 400, the relative positioning of the components, such as a display and keypad of the apparatus 400, the sensor assembly 414 may also detect a change in the position of the apparatus 400 or a component of the apparatus 400, the presence or absence of user contact with the apparatus 400, orientation or acceleration/deceleration of the apparatus 400, and a change in the temperature of the apparatus 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 416 is configured to facilitate wired or wireless communication between the apparatus 400 and other devices. The apparatus 400 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described target tracking methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 404 comprising instructions, executable by the processor 420 of the apparatus 400 to perform the above-described target tracking method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned object tracking method when executed by the programmable apparatus.

FIG. 5 is a block diagram illustrating an apparatus 500 for target tracking, according to an example embodiment. For example, the apparatus 500 may be provided as a server. Referring to fig. 5, the apparatus 500 includes a processing component 522 that further includes one or more processors and memory resources, represented by memory 532, for storing instructions, such as applications, that are executable by the processing component 522. The application programs stored in memory 532 may include one or more modules that each correspond to a set of instructions. Further, the processing component 522 is configured to execute instructions to perform the above-described target tracking method.

The apparatus 500 may also include a power component 526 configured to perform power management of the apparatus 500, a wired or wireless network interface 550 configured to connect the apparatus 500 to a network, and an input/output (I/O) interface 558. The apparatus 500 may operate based on an operating system, such as Windows Server, stored in the memory 532^TM，Mac OS X^TM，Unix^TM，Linux^TM，FreeBSD^TMOr the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A target tracking method, comprising:

2. The target tracking method according to claim 1, wherein determining a tracking result of tracking the target object for the first image according to the 3D detection block, the 2D detection block, the 3D prediction block, and the 2D prediction block comprises:

3. The target tracking method according to claim 2, wherein the determining a tracking result of tracking the target object for the first image according to the 3D detection block, the 2D detection block, the 3D prediction block, and the 2D prediction block further comprises:

4. The target tracking method according to claim 1, wherein the tracking result includes 3D tracking position information of the target object corresponding to the second image, 2D tracking position information, and motion data of the target object;

updating a tracker according to the motion data;

5. The target tracking method of claim 4, wherein the motion data comprises a rate of change of a position of the target object on an image, and a velocity and an acceleration of the target object in the target three-dimensional space;

6. The target tracking method according to claim 1, wherein the first image includes a plurality of captured images, the plurality of captured images being images captured by a plurality of image capturing devices at the same capturing timing;

7. The target tracking method of claim 1, prior to the determining a tracking result of tracking the target object for the first image from the 3D detection box, the 2D detection box, the 3D prediction box, and the 2D prediction box, the method further comprising:

performing non-maximum suppression processing on the 3D detection block, the 2D detection block, the 3D prediction block, and the 2D prediction block.

8. An object tracking device, comprising:

the detection module is configured to perform object detection on a first image in the image acquisition sequence to obtain a 3D detection frame of an object on the first image in a target three-dimensional space and a 2D detection frame on the first image, wherein the first image is any image except a first image in the image acquisition sequence;

9. An object tracking device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

10. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 7.