WO2023077754A1

WO2023077754A1 - Target tracking method and apparatus, and storage medium

Info

Publication number: WO2023077754A1
Application number: PCT/CN2022/090574
Authority: WO
Inventors: 梁浩; 武鹏
Original assignee: 北京小米移动软件有限公司
Priority date: 2021-11-05
Filing date: 2022-04-29
Publication date: 2023-05-11
Also published as: CN114549578A

Abstract

The present disclosure relates to a target tracking method and apparatus, and a storage medium. The method comprises: acquiring an image collection sequence; performing object detection on a first image in the image collection sequence, so as to obtain a 3D detection box, in the first image, of an object in a target three-dimensional space, and a 2D detection box of same in the first image, wherein the first image is any image in the image collection sequence other than the image ranking first therein; according to a tracking result of tracking a target object on the basis of a second image, predicting a 3D prediction box of the target object in the target three-dimensional space, and a 2D prediction box of same in the first image, wherein the second image is the previous image of the first image in the image collection sequence; and according to the 3D detection box, the 2D detection box, the 3D prediction box and the 2D prediction box, determining a tracking result of tracking the target object on the basis of the first image. By means of the target tracking method in the present disclosure, the accuracy of tracking a target object can be improved.

Description

Target tracking method, device and storage medium

technical field

The present disclosure relates to the technical field of computer vision, and in particular to an object tracking method, device and storage medium.

Background technique

Target tracking is to find the target in the image sequence by giving an image sequence, identify the same target in different frames, and assign ID to the same target in different frames. In related technologies, target tracking is usually performed based on 2D information. However, it is difficult to accurately estimate the target's motion only by using 2D information, resulting in false tracking.

Contents of the invention

In order to overcome the problems existing in related technologies, the present disclosure provides a target tracking method, device and storage medium.

According to a first aspect of an embodiment of the present disclosure, a target tracking method is provided, including:

Obtaining an image acquisition sequence, the image acquisition sequence is obtained according to the images acquired by the image acquisition device at multiple acquisition moments;

performing object detection on the first image in the image acquisition sequence to obtain a 3D detection frame of the object on the first image in the target three-dimensional space and a 2D detection frame on the first image, the first The image is any image in the image acquisition sequence except the first image;

Predict the 3D prediction frame of the target object in the target three-dimensional space and the 2D prediction frame on the first image according to the tracking result of tracking the target object with respect to the second image, the second image is the The previous image of the first image in the image acquisition sequence;

Determine a tracking result of tracking the target object with respect to the first image according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame.

In some embodiments, according to the 3D detection frame, the 2D detection frame, the 3D prediction frame and the 2D prediction frame, determine a tracking result of tracking the target object with respect to the first image ,include:

determining a target 3D detection frame matching the 3D prediction frame from each of the 3D detection frames according to the first intersection ratio and/or distance value between the 3D prediction frame and each of the 3D detection frames;

Taking the object corresponding to the target 3D detection frame in the target three-dimensional space as the target object, and using the position information of the target 3D detection frame in the target three-dimensional space as the 3D tracking position information of the target object , the tracking result includes the 3D tracking location information.

In some embodiments, according to the 3D detection frame, the 2D detection frame, the 3D prediction frame and the 2D prediction frame, determine a tracking result of tracking the target object with respect to the first image ,Also includes:

For a 3D detection frame that does not match the 3D prediction frame in each of the 3D detection frames, determine a 2D detection frame corresponding to the same object as the 3D detection frame, and The second intersection ratio is used to determine the target 2D detection frame matching the 2D prediction frame;

Taking the object corresponding to the target 2D detection frame on the first image as the target object, and using the position information of the target 2D detection frame on the first image as the 2D tracking position information of the target object , the tracking result includes the 2D tracking location information.

In some embodiments, the tracking result includes 3D tracking position information of the target object corresponding to the second image, 2D tracking position information, and motion data of the target object;

The predicting the 3D prediction frame of the target object in the three-dimensional space and the 2D prediction frame on the first image according to the tracking result of tracking the target object with respect to the second image includes:

updating a tracker based on said motion data;

The 3D tracking position information and the 2D tracking position information are input into the updated tracker to obtain the 3D prediction frame and the 2D prediction frame output by the tracker.

In some embodiments, the motion data includes the rate of change of the target object's position on the image, and the velocity and acceleration of the target object in the target three-dimensional space;

The tracker is capable of outputting the 2D predicted frame based on the position change rate and the 2D tracked position information, and outputting the 3D predicted frame based on the velocity, the acceleration, and the 3D tracked position information.

In some embodiments, the first image includes multiple captured images, and the multiple captured images are images captured by multiple image capture devices at the same capture moment;

The performing object detection on the first image in the image acquisition sequence to obtain the 3D detection frame of the object on the first image in the target three-dimensional space and the 2D detection frame on the first image includes:

performing object detection on the multiple captured images to obtain a 3D detection frame of the object on each of the captured images in the three-dimensional space of each of the image capture devices and a 2D detection frame on the captured images;

mapping the 3D detection frames located in the three-dimensional spaces of different image acquisition devices to the same target coordinate system according to the external parameters of each of the image capture devices, and the target three-dimensional space is a space defined by the target coordinate system;

Stitching the multiple collected images to obtain a spliced image, and mapping the 2D detection frames on the multiple collected images to the spliced image, where the 2D detection frame on the first image is the spliced image 2D detection boxes on .

In some embodiments, according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame, determine the tracking of the target object for the first image Before the result, the method also includes:

A non-maximum suppression process is performed on the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame.

According to a second aspect of an embodiment of the present disclosure, a target tracking device is provided, including:

The acquisition module is configured to acquire an image acquisition sequence, the image acquisition sequence is obtained according to the images acquired by the image acquisition device at multiple acquisition moments;

The detection module is configured to perform object detection on the first image in the image acquisition sequence, and obtain the 3D detection frame of the object on the first image in the target three-dimensional space and the 2D detection frame on the first image box, the first image is any image in the image acquisition sequence except the first image;

The prediction module is configured to predict a 3D prediction frame of the target object in the target three-dimensional space and a 2D prediction frame on the first image according to the tracking result of the target object being tracked with respect to the second image, The second image is a previous image of the first image in the image acquisition sequence;

The determination module is configured to determine a tracking result of tracking the target object with respect to the first image according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame.

In some embodiments, the determining module is further configured to:

The prediction module is further configured to:

updating a tracker based on said motion data;

The detection module is further configured to:

In some embodiments, the determining module is further configured to:

According to a third aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium on which computer program instructions are stored, and when the program instructions are executed by a processor, the steps of the target tracking method provided in the first aspect of the present disclosure are implemented.

According to a fourth aspect of an embodiment of the present disclosure, a target tracking device is provided, including:

processor;

memory for storing processor-executable instructions;

Wherein, the processor is configured as:

According to a fifth aspect of the embodiments of the present disclosure, a computer program product is provided, the computer program product includes a computer program executable by a programmable device, and the computer program has a function for realizing the present invention when executed by the programmable device. The steps of the target tracking method provided by the first aspect are disclosed.

The technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects: determine the tracking of the target object through 3D frame information (for example, 3D detection frame and 3D prediction frame) and 2D frame information (for example, 2D detection frame and 2D prediction frame) As a result, the 3D frame information is introduced, so that the target tracking method of the present disclosure can continue to perform motion estimation on the target object in the target three-dimensional space for a period of time after the target object is lost, which improves the successful matching probability after the target object reappears, Reduce the ID switching caused by the target missing or out of view, that is, reduce the wrong tracking of the target. At the same time, combining the 3D frame information and the 2D frame information to determine the tracking result of the target object can improve the tracking accuracy of the target object.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

The above and/or additional aspects and advantages of the present disclosure will become apparent and understandable from the following description of the embodiments in conjunction with the accompanying drawings, wherein:

Fig. 1 is a flow chart showing a method for tracking a target according to an exemplary embodiment.

Fig. 2 is a flow chart of determining a 3D detection frame and a 2D detection frame according to an exemplary embodiment.

Fig. 3 is a block diagram of an object tracking device according to an exemplary embodiment.

Fig. 4 is a block diagram of a device for target tracking according to an exemplary embodiment.

Fig. 5 is a block diagram of a device for target tracking according to an exemplary embodiment.

Detailed ways

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.

In some embodiments, the object tracking method of the present disclosure can be applied to different scenarios. For example, it can be applied to automatic driving scenarios to track targets in images collected by image acquisition devices on vehicles. For another example, it can be applied to a traffic monitoring scene to track a target in an image captured by an image acquisition device in a traffic monitoring system. It should be understood that the application scenarios of the target tracking method mentioned in the present disclosure are only some examples or embodiments of the present disclosure, and those of ordinary skill in the art can also use the target tracking method without creative work. The method is applicable to other similar scenarios, for example, it can also be applied to target tracking of a mobile robot, which is not limited in the present disclosure.

In related technologies, 2D information is usually used to track targets in images captured by one or more image acquisition devices. However, due to the affine transformation of targets projected onto 2D images, it is difficult to accurately track targets on 2D images. Therefore, the target cannot be accurately tracked using 2D information, resulting in matching the wrong ID for the target. And when using 2D information for target tracking, once the target is lost, it is difficult to retrieve it. In addition, in the related art, images acquired by multiple image acquisition devices are usually tracked separately, which is not only inefficient, but also cannot deal with overlapping targets in images of adjacent image acquisition devices.

Fig. 1 is a flowchart of a target tracking method according to an exemplary embodiment. As shown in Fig. 1 , the method includes the following steps.

Step 110, acquiring an image acquisition sequence, which is obtained according to the images acquired by the image acquisition device at multiple acquisition moments.

In some embodiments, the image acquisition sequence may be obtained according to images acquired by one or more image acquisition devices at multiple acquisition moments. For an image acquisition device, the acquired image at each acquisition moment in the image acquisition sequence may be the acquired image of the image acquisition device at the acquisition moment; for multiple image acquisition devices, the acquired image at each acquisition moment in the image acquisition sequence may be is the captured image of the plurality of image capturing devices at the capturing moment.

Illustratively, taking an image acquisition device as image acquisition device 1, and the acquired images of the image acquisition device 1 at multiple acquisition moments t1, t2 and t3 are P1, P2, P3 as an example, then the image acquisition sequence 1 may be (P1 , P2, P3). Taking multiple image acquisition devices as image acquisition devices 1-3 as an example, if the images acquired by image acquisition device 1 at multiple acquisition times t1, t2 and t3 are P ₁₁ , P ₁₂ , P ₁₃ , image acquisition device 2 is The acquired images at the acquisition times t1, t2 and t3 are P ₂₁ , P ₂₂ and P ₂₃ , and the acquired images of the image acquisition device 3 at multiple acquisition times t1, t2 and t3 are P ₃₁ , P ₃₂ and P ₃₃ as an example, Then the image acquisition sequence 2 may be (P ₁₁ P ₂₁ P ₁₁ , P ₁₂ P ₂₂ P ₃₂ , P ₁₃ P ₂₃ P ₃₃ ).

In some embodiments, image capture devices may include, but are not limited to, video cameras and cameras. The image acquisition device can be set at a preset fixed position or in a mobile device, and the preset fixed position and the mobile device can be specifically set according to actual needs. For example, a mobile device could be an autonomous vehicle. In some embodiments, the image capture device may be one or more cameras included in the autonomous vehicle.

In some embodiments, when there are multiple image capture devices, the capture directions of the multiple image capture devices may be different. For example, still taking the multiple image acquisition devices as the image acquisition devices 1-3 as an example, the acquisition directions of the image acquisition devices 1-3 may be left direction, forward direction, right direction, etc. respectively. It is worth noting that the collection directions of multiple image collection devices may be specifically set according to actual conditions, and this disclosure does not impose any limitation on this.

In some embodiments, an image capture sequence may be acquired based on video captured by one or more image capture devices. Exemplarily, the captured images in the image capturing sequence may be image frames included in the video.

Step 120, perform object detection on the first image in the image acquisition sequence, and obtain the 3D detection frame of the object on the first image in the target three-dimensional space and the 2D detection frame on the first image, the first image is the image acquisition sequence Any image in , except the first image.

In some embodiments, the object on the first image may refer to one or more objects included in the first image, and the objects may include different types of objects. For example, taking the first image as an example of a road condition image, the objects on the first image may include objects of the pedestrian category and objects of the vehicle category.

In some embodiments, object detection may be performed on the first image according to a monocular 3D detection algorithm. In some embodiments, the monocular 3D detection algorithm may include but not limited to a fully convolutional single-stage 3D monocular target detection method (Fully Convolutional One-Stage Monocular 3D Object Detection, FCOS 3D) and a real-time monocular 3D target detection algorithm ( Real-time Monocular 3D Object Detection, RTM 3D).

The 3D detection frame of the object included in the image in the three-dimensional space (for example, the camera coordinate system) of the image acquisition device and the 2D detection frame in the image coordinate system of the image can be simultaneously obtained through the monocular 3D detection algorithm. In some embodiments, the 3D detection frame in the three-dimensional space of the image acquisition device obtained by the detection algorithm may be represented by (x, y, z, rot, w, h, l), where (x, y, z) may be Characterize the coordinates of the center point of the 3D detection frame in the three-dimensional space of the image acquisition device, rot can represent the heading angle of the 3D detection frame, and (w, h, l) can represent the width, height and length of the 3D detection frame respectively. The 2D detection frame obtained by the detection algorithm can be represented by (x1, y1, x2, y2), where (x1, y1) can represent the coordinates of the upper left corner of the 2D detection frame in the image coordinate system, and (x2, y2) can represent The coordinates of the lower right corner of the 2D detection box in the image coordinate system.

As mentioned above, the image acquisition device may include one or more. For one image acquisition device, the first image may be an acquired image. For example, still taking the above image acquisition sequence 1 as an example, the first image may be P2 or P3. In some embodiments, when the first image is a captured image, object detection may be performed on the captured image to obtain the 3D detection frame of the object on the captured image in the three-dimensional space of the image capture device, and the 3D detection frame of the object on the captured image. The 2D detection frame, the 2D detection frame on the collected image is the 2D detection frame in the image coordinate system of the collected image. In some embodiments, when the first image is a captured image, the 3D detection frame in the three-dimensional space of the image capture device may be determined as the 3D detection frame in the target three-dimensional space, or the 3D detection frame in the three-dimensional space of the image capture device may be determined as The 3D detection frame is mapped to the target coordinate system, and the 3D detection frame in the target coordinate system is determined as the 3D detection frame in the target three-dimensional space. For specific details about the target coordinate system, reference may be made to FIG. 2 and its related descriptions, which will not be repeated here.

For multiple image capture devices, the first image may include multiple capture images, and the multiple capture images may be images captured by multiple image capture devices at the same capture time. For example, still taking the above image acquisition sequence 2 as an example, the first image may include P ₁₂ , P ₂₂ and P ₃₂ , or include P ₁₃ , P ₂₃ and P ₃₃ . In some embodiments, when the first image is a plurality of captured images, object detection may be performed on the multiple captured images to obtain a 3D detection frame of the object on each captured image in the three-dimensional space of each image capture device And a 2D detection frame on the captured image. In some embodiments, the 3D detection frame in the three-dimensional space in each image acquisition device and the 2D detection frame on each captured image can be processed separately to obtain the 3D detection frame in the target three-dimensional space and the 3D detection frame in the first image 2D detection boxes on . For specific details of obtaining the 3D detection frame and the 2D detection frame when the first image is a plurality of acquired images, refer to FIG. 2 and its related descriptions, which will not be repeated here.

Step 130: Predict the 3D prediction frame of the target object in the target three-dimensional space and the 2D prediction frame on the first image according to the tracking result of the target object in the second image, the second image is the first image in the image acquisition The previous image in the sequence.

In some embodiments, the first image and the second image may be images acquired by the image acquisition device at different acquisition moments, wherein the second image is an image obtained by the image acquisition device at the last acquisition moment of the first image acquisition moment , that is, the second image is the previous image of the first image in the image acquisition sequence.

In some embodiments, the target object may be one or more objects included in the second image, and the target objects may include objects of different categories. For example, objects of pedestrian class and objects of vehicle class etc. In some embodiments, the tracking result of tracking the target object with respect to the second image may include 3D tracking position information, 2D tracking position information and motion data of the target object corresponding to the second image. Regarding the manner of determining the 3D tracking position information and the 2D tracking position information, reference may be made to the following step 140 and related descriptions, and details are not repeated here.

In some embodiments, the motion data may include the rate of change of the target object's position on the image and the velocity and acceleration of the target object in the target three-dimensional space. The position change rate of the target object on the image may refer to the position change rate between the 2D tracking information of the target object on the second image and the 2D tracking information of the target object on the first image. For details about determining the position change rate, reference may be made to the relevant description of the tracker below, and details are not repeated here.

In some embodiments, the velocity and acceleration of the target object in the target three-dimensional space may refer to the velocity and acceleration of the target object at the acquisition moment corresponding to the second image. For example, taking the target object as a pedestrian and the acquisition time t2 corresponding to the second image as an example, the velocity and acceleration of the target object in the target three-dimensional space may be the velocity and acceleration of the pedestrian at the time t2. It can be understood that the velocity and acceleration is the walking speed and acceleration of pedestrians in real space. In some embodiments, when the image capture device is set in the mobile device, the velocity and acceleration of the mobile device at the moment of capturing the second image may be determined as the velocity and acceleration of the target object in the target three-dimensional space. For example, if the mobile device is an automatic driving vehicle, the motion data may be the speed and acceleration of the vehicle at time t2. It can be understood that the speed and acceleration are the driving speed and acceleration of the vehicle in real space.

In some embodiments, the 3D prediction frame of the target object in the target three-dimensional space and the 2D prediction frame on the first image may be predicted according to the tracking result of the target object tracked by the tracker on the second image. In some embodiments, predicting the 3D prediction frame of the target object in the target three-dimensional space and the 2D prediction frame on the first image according to the tracking result of the target object in the second image includes: updating the tracking according to the motion data The 3D tracking position information and the 2D tracking position information are input into the updated tracker, and the 3D prediction frame and the 2D prediction frame output by the tracker are obtained.

In some embodiments, the tracker is capable of outputting a 2D predicted frame based on the rate of change of position and 2D tracked position information, and a 3D predicted frame based on velocity, acceleration, and 3D tracked position. In some embodiments, the value of the rotation angle of the 3D prediction frame set by the tracker remains unchanged.

In some embodiments, the 2D tracking information of the target object may be a 2D detection frame corresponding to the 2D tracking information. In some embodiments, the 2D tracking information can be represented by (cx, cy, w, h), where (cx, cy) represents that the center point of the 2D detection frame corresponding to the 2D tracking information is in the image coordinate system of the first image The coordinates in (w, h) can represent the width and height of the 2D detection frame. In some embodiments, the 3D tracking information of the target object may be a 3D detection frame corresponding to the 3D tracking information. In some embodiments, the 3D tracking information can be characterized by (x, y, rot), where (x, y) can represent the coordinates of the center point of the 3D detection frame corresponding to the 3D tracking information in the target three-dimensional space, and rot represents The rotation angle of the 3D detection box.

In some embodiments, the 3D prediction frame of each target object in the target three-dimensional space and the 2D prediction frame on the first image may be predicted according to the tracking results of each target object tracked by multiple trackers on the second image frame. In some embodiments, for any target object, the tracker corresponding to the target object may include the state transition function of the target object, and track the target object according to the state transition function of the target object and the second image. As a result of tracking, predict the 3D prediction frame of the target object in the target three-dimensional space and the 2D prediction frame on the first image.

In some embodiments, for any target object, the state transition function of the target object included in the tracker corresponding to the target object is the following formula (1):

For the meanings of (cx, cy, w, h) and (x, y, rot), reference may be made to the relevant description above, which will not be repeated here. (V _cx , V _cy , V _w , V _h ) represent the position change rate of the target object on the image, corresponding to (cx, cy, w, h), V _x and V _y respectively represent the position of the target object along the target three-dimensional space The velocities in the X-axis and Y-axis directions, a _x and a _y respectively represent the acceleration of the target object along the X-axis and Y-axis directions in the target three-dimensional space.

In some embodiments, when the second image is the first image in the image collection sequence, the rate of change of the position of the target object on the image is an initial value of 0. Exemplarily, still taking the above-mentioned image acquisition sequence 1 "(P1, P2, P3)" as an example, assuming that the first image is P2, which includes target objects 1-5, and the second image is P1, which includes objects 1-5, The following will describe the process of the tracker 1 corresponding to the target object 1 predicting the 3D prediction frame and the 2D prediction frame of the target object 1 for each captured image in the image capture sequence 1 with reference to an example. First, by performing object detection on the second image P1, the 2D detection frame and the 3D detection frame of the target object 1 corresponding to the second image P1 can be obtained. Since the second image P1 is the first image, at this time, the corresponding position of the target object 1 changes The rate 1 is 0, and the 3D tracking position information and 2D tracking position information corresponding to the second image P1 of the target object 1 is the 3D detection frame and 2D detection frame obtained by its object detection (that is, the 3D detection frame and 2D detection frame obtained through the above step 120). detection frame), at the same time, the speed 1 and acceleration 1 of the target object 1 at the acquisition time corresponding to the second image P1 can be detected, and the tracker 1 is updated according to the position change rate 1, speed 1 and acceleration 1, that is, the above formula (1) Update, input the 3D tracking position information and 2D tracking position information into the updated tracker 1 to obtain the 3D prediction frame and 2D prediction frame of the target object 1 corresponding to the image at the next moment (that is, the first image P2) . Further, the 3D detection frame and the 2D detection frame of the object on the first image P2 can be obtained according to the object detection, and the object belonging to the same target as the target object 1 in the first image P2 can be obtained by means of the following step 140, Suppose it is an object 1, further, the position change rate 2 of the target object 1 on the second image P1 and the first image P2 can be obtained through the 2D detection frame corresponding to the object 1 and the 2D detection frame of the target object 1, through the position change rate 2 and the velocity and acceleration of the target object 1 at the acquisition moment of the first image P2, the 3D prediction frame and the 2D prediction frame of the target object 1 corresponding to the third image P3 can be obtained. It can be seen that, through the tracker 1 corresponding to the target object 1, the position information of the target object 1 on each captured image in the image capture sequence can be obtained, so as to realize the tracking of the target object 1 on the image sequence.

In the embodiment of this specification, by outputting the 2D prediction frame based on the position change rate and 2D tracking position information according to the tracker, that is, the tracker uses a constant velocity model to realize the prediction or motion estimation of the 2D frame, which can reduce the calculation amount of predicting the 2D frame. By outputting a 3D prediction frame based on the velocity, acceleration, and 3D tracking position information of the tracker, that is, the tracker uses a uniform acceleration model to realize prediction or motion estimation of the 3D frame, which can improve the prediction accuracy of the 3D frame. And the angle of the 3D detection frame obtained through object detection has a large uncertainty. By keeping the rotation angle of the 3D prediction frame in the tracker unchanged, the static model is used to smooth the rotation angle to further improve the prediction accuracy of the 3D frame. Spend.

Step 140, determine a tracking result of tracking the target object with respect to the first image according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame.

In some embodiments, before determining the tracking result of tracking the target object with respect to the first image according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame, the method further includes: the 3D detection frame, The 2D detection frame, 3D prediction frame, and 2D prediction frame perform non-maximum value suppression processing. By performing non-maximum suppression processing, overlapping 3D frames (ie, 3D detection frames or 3D prediction frames) and 2D frames (ie, 2D detection frames or 2D prediction frames) can be filtered out to avoid overlapping frames affecting subsequent 3D frames. The matching between , and the matching between 2D boxes improves the accuracy of subsequent matching, thus improving the accuracy of object tracking.

In some embodiments, according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame, determining the tracking result of tracking the target object with respect to the first image includes: according to the 3D prediction frame and each 3D detection frame Between the first intersection ratio and/or distance value, determine the target 3D detection frame matching with the 3D prediction frame from each 3D detection frame; use the object corresponding to the target 3D detection frame in the target three-dimensional space as the target object, and The position information of the target 3D detection frame in the target three-dimensional space is used as the 3D tracking position information of the target object, and the tracking result includes the 3D tracking position information.

In some embodiments, the first intersection ratio may refer to an overlap rate between the 3D prediction frame and the 3D detection frame, that is, the ratio of the intersection and union of the 3D prediction frame and the 3D detection frame. In some embodiments, there may be one or more target objects, and correspondingly, the 3D prediction frame of the target object in the target three-dimensional space may also include one or more. In some embodiments, for multiple 3D prediction frames, the first intersection ratio matrix may be determined based on the first intersection ratio between each 3D prediction frame and each 3D detection frame. In some embodiments, the distance value may be the distance between the center points of the 3D prediction frame and the 3D detection frame in the target three-dimensional space. The distance may include, but is not limited to, Manhattan distance or Euclidean distance, among others. In some embodiments, for multiple 3D prediction frames, a distance matrix may be determined based on the distance value between each 3D prediction frame and each 3D detection frame. In the embodiment of this specification, the first intersection ratio matrix or distance matrix can be solved according to the Hungarian algorithm, the matching result between the 3D prediction frame and the 3D detection frame can be determined, and each 3D prediction frame can be determined from each 3D detection frame. The box matches the target 3D detection box.

As mentioned earlier, target objects or objects can be different classes of objects. In some embodiments, for the target object and the object of the vehicle category, the first intersection ratio matrix may be determined based on the first intersection ratio value between each 3D prediction frame and each 3D detection frame. For target objects and objects of the pedestrian category, a distance matrix may be determined based on the distance values between each 3D prediction frame and each 3D detection frame. In the embodiment of this specification, the first cross-over-unit ratio matrix and the distance matrix can be solved separately according to the Hungarian algorithm, the matching result between the 3D prediction frame and the 3D detection frame can be determined, and each 3D detection frame can be determined from each 3D detection frame. The target 3D detection box that the predicted box matches.

For example, assuming that the second image includes target objects 1-5, the 3D prediction frames of target objects 1-5 are respectively N _3D-1 , N _3D-2 , N _3D-3 , N _3D-4 , N _3D-5 , the first image includes objects 1-5, and the 3D detection frames of objects 1-5 are respectively M _3D-1 , M _3D-2 , M _3D-3 , M _3D-4 , M _3D-5 as an example, if according to The Hungarian algorithm solves the first intersection ratio matrix and the distance matrix to obtain the matching results: N _3D-1 -M _3D-2 , N _3D-2 -M _3D-1 , N _3D-3 -M _3D-3 , then with The target 3D detection frame matched by 3D prediction frame N _3D-1 is 3D detection frame M _3D-2 , the target 3D detection frame matching 3D prediction frame N _3D-2 is 3D detection frame M _3D-1 , and 3D prediction frame N The target 3D detection frame matched by _3D-3 is the 3D detection frame M _3D-3 .

In some embodiments, according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame, determining the tracking result of tracking the target object for the first image further includes: for each 3D detection frame that does not match a 3D The 3D detection frame of the prediction frame, determine the 2D detection frame corresponding to the same object as the 3D detection frame, and determine the target 2D detection frame matching the 2D prediction frame according to the second intersection ratio between the 2D detection frame and the 2D prediction frame frame: the object corresponding to the target 2D detection frame on the first image is used as the target object, and the position information of the target 2D detection frame on the first image is used as the 2D tracking position information of the target object, and the tracking result includes the 2D tracking position information.

The method of determining the second intersection ratio between the 2D detection frame and the 2D prediction frame is similar to that of the first intersection ratio, and will not be repeated here. In some embodiments, the second intersection ratio matrix can be determined according to the second intersection ratio between the 2D detection frame and the 2D prediction frame, and the second intersection ratio matrix can be solved according to the Hungarian algorithm to determine the 2D detection frame The matching result with the 2D prediction frame, so as to determine the target 2D detection frame that matches the 2D prediction frame.

Exemplarily, still taking the foregoing example as an example, the 3D detection frames that do not match the 3D prediction frames include M _3D-4 and M _3D-5 , and it is determined that the 3D detection frames M _3D-4 and M _{3D- 5} corresponding to the 2D detection frames M _2D-4 and M _2D-5 of the same object, and the 2D detection frames M _2D -4 and M 2D _-5 and the 2D prediction frames N _2D-1 , N _2D-2 , N _2D-3 , The second intersection ratio matrix between N _2D-4 and N _2D-3 (that is, the 2D prediction frame of the target object 1-5), and solve the second intersection ratio matrix according to the Hungarian algorithm to obtain the 2D detection frame and the matching results M _2D-4 -N _2D-4 , M _2D-5 -N _2D-5 between the 2D prediction frames.

The 3D prediction frame and the 2D prediction frame are the 3D frame and the 2D frame corresponding to the target object in the second image, and the 3D detection frame and the 3D detection frame are the 3D frame and the 2D frame corresponding to the object in the first image. The detection frame is matched, and the target object and object belonging to the same target in the second image and the first image can be determined to realize target tracking. For example, taking the above matching result M _2D-4 -N _2D-4 as an example, it can be determined that the object 4 in the second image at the previous moment belongs to the same object as the target object 4 in the first image at the next moment. In some embodiments, target objects and objects belonging to the same target may be assigned the same ID.

In the embodiment of this specification, the 3D detection frame is first matched with the 3D prediction frame, and then the 2D detection frame is matched with the 2D prediction frame, and a two-stage matching is adopted. Using 3D frames (that is, 3D detection frames and 3D prediction frames) for matching has high precision, and the probability of false matching is small. Using 2D frames (that is, 2D detection frames and 2D prediction frames) for matching makes the probability of missing matches low. Therefore, through the two-stage matching, while achieving high-precision matching, the situation of missing matching can be reduced, that is, the matching accuracy between the target object in the image at the previous moment and the object in the image at the next moment can be improved, and the target object can be reduced. Object-to-object missing matching, thereby improving the tracking accuracy of the same target in images at different times.

In some embodiments, a new tracker may be created for the target object for which neither the 3D predicted frame nor the 2D predicted frame is successfully matched. In some embodiments, the tracker corresponding to the target object that fails to match the 3D prediction frame and the 2D prediction frame for a preset number of times may be discarded. The preset number of times can be specifically set according to actual conditions. It can be seen from the above that the 3D prediction frame and the 2D prediction frame correspond to two-stage matching. If there is no matching result for the target object in the two-stage matching of the preset number of times, it can be considered that the tracker failed to track the target object, and the target object of trackers dropped.

In the embodiment of the present disclosure, the tracking result of the target object is determined through the 3D frame and 2D frame information, and the 3D information is introduced, so that the target tracking method of the present disclosure can continue to target the target object in the three-dimensional space for a period of time after the target object is lost. The motion estimation in the system improves the probability of successful matching after the target object reappears, and reduces the ID switching caused by the target missing or out of the field of view, that is, the error tracking of the target is reduced.

Fig. 2 is a flowchart of determining a 3D detection frame and a 2D detection frame according to an exemplary embodiment. As shown in Fig. 2 , the method includes the following steps.

Step 210, performing object detection on the plurality of captured images to obtain the 3D detection frame of the object on each of the captured images in the three-dimensional space of each of the image capture devices and the 2D detection frame on the captured images. detection box.

The specific details of step 210 are similar to step 120, for details, please refer to the above-mentioned step 210 and related descriptions, which will not be repeated here.

Exemplarily, still taking the aforementioned multiple captured images as P1, P2, and P3 as an example, object detection can be performed on the captured image P1 to obtain the 3D detection frame of the object on the captured image P1 in the three-dimensional space of the image capture device 1 and The 2D detection frame in the image coordinate system of the captured image P1; the object detection of the captured images P2 and P3 is similar to that of the captured image P1, and will not be repeated here.

Step 220: Map the 3D detection frames located in the three-dimensional space of different image acquisition devices to the same target coordinate system according to the external parameters of each of the image capture devices, and the target three-dimensional space is the space defined by the target coordinate system .

In some embodiments, the target coordinate system can be determined according to the position set by the image acquisition device. For example, if the image acquisition device is set in an automatic driving vehicle, the target coordinate system may be the own vehicle coordinate system corresponding to the first image in the image acquisition sequence. For another example, if the image acquisition device is set at a preset fixed position, the target coordinate system may be a coordinate system determined based on the preset fixed position, and the origin, X axis, Y axis, and Z axis of the coordinate system may be specifically set according to actual conditions .

In some embodiments, the extrinsic parameters of each image acquisition coordinate system may reflect the pose relationship between the image acquisition coordinate system and the target coordinate system. Extrinsic parameters can include translation parameters and rotation parameters. In some embodiments, the external parameters of each image acquisition coordinate system can be obtained by calibrating the image acquisition device. Regarding the calibration of the image acquisition device, reference may be made to related technologies, which will not be repeated here.

Step 230, stitching the plurality of collected images to obtain a stitched image, and mapping the 2D detection frames on the plurality of collected images to the stitched image, the 2D detection frame on the first image is the 2D detection boxes on the stitched image.

In some embodiments, the 2D detection frame on the stitched image may refer to the 2D detection frame in the image coordinate system of the stitched image. By mapping the 2D detection frames on multiple captured images to the spliced image, the 2D detection frames in the image coordinate system of each captured image can be converted to the image coordinate system of the spliced image, that is, corresponding to different image coordinate systems. The 2D detection frame is transformed to the same image coordinate system.

In the embodiment of this specification, by mapping the 3D detection frames in the three-dimensional space of different image acquisition devices to the same target coordinate system, and mapping the 2D detection frames on multiple captured images to the spliced image, that is, multiple The detection results of multiple acquired images of the image acquisition device are fused, and the targets in multiple acquired images can be tracked at the same time, so that only one tracking algorithm is needed to realize the tracking of the same target in different image acquisition images. The problem of inefficiency caused by separately tracking the target in the captured image of each image capture device is avoided, and the problem of ID switching of the same target in different image capture devices is reduced.

Fig. 3 is a block diagram of an object tracking device 300 according to an exemplary embodiment. Referring to FIG. 3 , the device includes an acquisition module 310 , a detection module 320 , a prediction module 330 and a determination module 340 .

The acquisition module 310 is configured to acquire an image acquisition sequence, the image acquisition sequence is obtained according to the acquired images of the image acquisition device at multiple acquisition moments;

The detection module 320 is configured to perform object detection on the first image in the image acquisition sequence, and obtain the 3D detection frame of the object on the first image in the target three-dimensional space and the 2D detection frame on the first image. A detection frame, the first image is any image in the image acquisition sequence except the first image;

The prediction module 330 is configured to predict a 3D prediction frame of the target object in the target three-dimensional space and a 2D prediction frame on the first image according to the tracking result of the target object being tracked on the second image, The second image is a previous image in the image acquisition sequence of the first image;

The determining module 340 is configured to determine a tracking result of tracking the target object with respect to the first image according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame.

In some embodiments, the determining module 340 is further configured to:

The prediction module 330 is further configured to:

updating a tracker based on said motion data;

The detection module 320 is further configured to:

In some embodiments, the determining module 340 is further configured to:

Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

The present disclosure also provides a computer-readable storage medium, on which computer program instructions are stored, and when the program instructions are executed by a processor, the steps of the target tracking method provided in the present disclosure are realized.

Fig. 4 is a block diagram of an apparatus 400 for object tracking according to an exemplary embodiment. For example, the apparatus 400 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.

Referring to FIG. 4, the device 400 may include one or more of the following components: a processing component 402, a memory 404, a power component 406, a multimedia component 408, an audio component 410, an input/output (I/O) interface 412, a sensor component 414, and communication component 416 .

The processing component 402 generally controls the overall operations of the device 400, such as those associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to complete all or part of the steps of the above method for object tracking. Additionally, processing component 402 may include one or more modules that facilitate interaction between processing component 402 and other components. For example, processing component 402 may include a multimedia module to facilitate interaction between multimedia component 408 and processing component 402 .

The memory 404 is configured to store various types of data to support operations at the device 400 . Examples of such data include instructions for any application or method operating on device 400, contact data, phonebook data, messages, pictures, videos, etc. The memory 404 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

Power component 406 provides power to various components of device 400 . Power components 406 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 400 .

The multimedia component 408 includes a screen that provides an output interface between the device 400 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component 408 includes a front camera and/or a rear camera. When the device 400 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.

The audio component 410 is configured to output and/or input audio signals. For example, the audio component 410 includes a microphone (MIC), which is configured to receive external audio signals when the device 400 is in operation modes, such as call mode, recording mode and voice recognition mode. Received audio signals may be further stored in memory 404 or sent via communication component 416 . In some embodiments, the audio component 410 also includes a speaker for outputting audio signals.

The I/O interface 412 provides an interface between the processing component 402 and a peripheral interface module. The peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.

Sensor assembly 414 includes one or more sensors for providing status assessments of various aspects of device 400 . For example, the sensor component 414 can detect the open/closed state of the device 400, the relative positioning of components, such as the display and keypad of the device 400, and the sensor component 414 can also detect a change in the position of the device 400 or a component of the device 400 , the presence or absence of user contact with the device 400 , the device 400 orientation or acceleration/deceleration and the temperature change of the device 400 . The sensor assembly 414 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 414 may also include an optical sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.

The communication component 416 is configured to facilitate wired or wireless communication between the apparatus 400 and other devices. The device 400 can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, apparatus 400 may be programmed by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the above object tracking method.

In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium including instructions, such as the memory 404 including instructions, which can be executed by the processor 420 of the device 400 to implement the above object tracking method. For example, the non-transitory computer readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

In another exemplary embodiment, there is also provided a computer program product comprising a computer program executable by a programmable device, the computer program having a function for performing the above-mentioned The code section of the object tracking method.

Fig. 5 is a block diagram of an apparatus 500 for object tracking according to an exemplary embodiment. For example, the apparatus 500 may be provided as a server. 5, apparatus 500 includes processing component 522, which further includes one or more processors, and memory resources represented by memory 532 for storing instructions executable by processing component 522, such as application programs. The application program stored in memory 532 may include one or more modules each corresponding to a set of instructions. In addition, the processing component 522 is configured to execute instructions to perform the above object tracking method.

Device 500 may also include a power component 526 configured to perform power management of device 500 , a wired or wireless network interface 550 configured to connect device 500 to a network, and an input-output (I/O) interface 558 . The apparatus 500 may operate based on an operating system stored in the memory 532, such as Windows Server ^™ , Mac OS X ^™ , Unix ^™ , Linux ^™ , FreeBSD ^™ or the like.

Other embodiments of the disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any modification, use or adaptation of the present disclosure, and these modifications, uses or adaptations follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field not disclosed in the present disclosure . The specification and examples are to be considered exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It should be understood that the present disclosure is not limited to the precise constructions which have been described above and shown in the drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

A target tracking method, characterized in that, comprising:

Obtaining an image acquisition sequence, the image acquisition sequence is obtained according to the images acquired by the image acquisition device at multiple acquisition moments;

performing object detection on the first image in the image acquisition sequence to obtain a 3D detection frame of the object on the first image in the target three-dimensional space and a 2D detection frame on the first image, the first The image is any image in the image acquisition sequence except the first image;

Predict the 3D prediction frame of the target object in the target three-dimensional space and the 2D prediction frame on the first image according to the tracking result of tracking the target object with respect to the second image, the second image is the The previous image of the first image in the image acquisition sequence;

Determine a tracking result of tracking the target object with respect to the first image according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame.
The target tracking method according to claim 1, wherein, according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame, it is determined for the first image pair The tracking results of the tracking of the target object include:

determining a target 3D detection frame matching the 3D prediction frame from each of the 3D detection frames according to the first intersection ratio and/or distance value between the 3D prediction frame and each of the 3D detection frames;

Taking the object corresponding to the target 3D detection frame in the target three-dimensional space as the target object, and using the position information of the target 3D detection frame in the target three-dimensional space as the 3D tracking position information of the target object , the tracking result includes the 3D tracking location information.
The target tracking method according to claim 2, wherein, according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame, it is determined for the first image pair The tracking result of the tracking of the target object also includes:

For a 3D detection frame that does not match the 3D prediction frame in each of the 3D detection frames, determine a 2D detection frame corresponding to the same object as the 3D detection frame, and The second intersection ratio is used to determine the target 2D detection frame matching the 2D prediction frame;

Taking the object corresponding to the target 2D detection frame on the first image as the target object, and using the position information of the target 2D detection frame on the first image as the 2D tracking position information of the target object , the tracking result includes the 2D tracking location information.
The target tracking method according to claim 1, wherein the tracking result includes 3D tracking position information of the target object corresponding to the second image, 2D tracking position information and motion data of the target object;

The predicting the 3D prediction frame of the target object in the three-dimensional space and the 2D prediction frame on the first image according to the tracking result of tracking the target object with respect to the second image includes:

updating a tracker based on said motion data;

The 3D tracking position information and the 2D tracking position information are input into the updated tracker to obtain the 3D prediction frame and the 2D prediction frame output by the tracker.
The target tracking method according to claim 4, wherein the motion data includes the rate of change of the position of the target object on the image, and the velocity and acceleration of the target object in the three-dimensional space of the target;

The tracker is capable of outputting the 2D predicted frame based on the position change rate and the 2D tracked position information, and outputting the 3D predicted frame based on the velocity, the acceleration, and the 3D tracked position information.
The target tracking method according to claim 1, wherein the first image comprises a plurality of acquired images, and the plurality of acquired images are images acquired by a plurality of image acquisition devices at the same acquisition time;

The performing object detection on the first image in the image acquisition sequence to obtain the 3D detection frame of the object on the first image in the target three-dimensional space and the 2D detection frame on the first image includes:

performing object detection on the multiple captured images to obtain a 3D detection frame of the object on each of the captured images in the three-dimensional space of each of the image capture devices and a 2D detection frame on the captured images;

mapping the 3D detection frames located in the three-dimensional spaces of different image acquisition devices to the same target coordinate system according to the external parameters of each of the image capture devices, and the target three-dimensional space is a space defined by the target coordinate system;

Stitching the multiple collected images to obtain a spliced image, and mapping the 2D detection frames on the multiple collected images to the spliced image, where the 2D detection frame on the first image is the spliced image 2D detection boxes on .
According to the target tracking method according to claim 1, in said according to said 3D detection frame, said 2D detection frame, said 3D prediction frame and said 2D prediction frame, determining the pair of said target for said first image Before the tracking result of the object being tracked, the method further includes:

A non-maximum suppression process is performed on the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame.
A target tracking device, characterized in that it comprises:

The acquisition module is configured to acquire an image acquisition sequence, the image acquisition sequence is obtained according to the images acquired by the image acquisition device at multiple acquisition moments;

The detection module is configured to perform object detection on the first image in the image acquisition sequence, and obtain the 3D detection frame of the object on the first image in the target three-dimensional space and the 2D detection frame on the first image box, the first image is any image in the image acquisition sequence except the first image;

The prediction module is configured to predict a 3D prediction frame of the target object in the target three-dimensional space and a 2D prediction frame on the first image according to the tracking result of the target object being tracked with respect to the second image, The second image is a previous image of the first image in the image acquisition sequence;

The determination module is configured to determine a tracking result of tracking the target object with respect to the first image according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame.
A target tracking device, characterized in that it comprises:

processor;

memory for storing processor-executable instructions;

Wherein, the processor is configured as:

Obtaining an image acquisition sequence, the image acquisition sequence is obtained according to the images acquired by the image acquisition device at multiple acquisition moments;

performing object detection on the first image in the image acquisition sequence to obtain a 3D detection frame of the object on the first image in the target three-dimensional space and a 2D detection frame on the first image, the first The image is any image in the image acquisition sequence except the first image;

Predict the 3D prediction frame of the target object in the target three-dimensional space and the 2D prediction frame on the first image according to the tracking result of tracking the target object with respect to the second image, the second image is the The previous image of the first image in the image acquisition sequence;

Determine a tracking result of tracking the target object with respect to the first image according to the 3D detection frame, the 2D detection frame, the 3D prediction frame, and the 2D prediction frame.
A computer-readable storage medium, on which computer program instructions are stored, characterized in that, when the program instructions are executed by a processor, the steps of the method described in any one of claims 1-7 are implemented.
A computer program product comprising a computer program executable by a programmable device, characterized in that the computer program has functions for implementing any one of claims 1-7 when executed by the programmable device steps of the method described in the item.