CN117152204A

CN117152204A - Target tracking method, device, equipment and storage medium

Info

Publication number: CN117152204A
Application number: CN202310967443.7A
Authority: CN
Inventors: 鲍慊; 刘武; 孙宇; 梅涛
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd; Jingdong Technology Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd; Jingdong Technology Information Technology Co Ltd
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2023-12-01

Abstract

Embodiments of the present disclosure provide a target tracking method, apparatus, device, and medium. The method described herein comprises: extracting a plurality of time domain image feature graphs and a plurality of optical flow feature graphs among a plurality of frames of a target video; determining a three-dimensional reference position of at least one object in each of a plurality of frames based on the plurality of temporal image feature maps, the at least one object comprising a target object; determining at least a plurality of motion bias maps for a plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, each motion bias map indicating a motion bias of at least one object in a corresponding frame relative to a previous frame in three-dimensional space; and determining a motion trajectory of the target object in the three-dimensional space based at least on the plurality of motion bias maps by referencing the three-dimensional reference position of the target object in each of the plurality of frames. Therefore, accurate and efficient target tracking can be realized based on the motion bias diagram.

Description

Target tracking method, device, equipment and storage medium

Technical Field

Example embodiments of the present disclosure relate generally to the field of computer vision and, more particularly, relate to object tracking methods, apparatuses, devices, and computer-readable storage media.

Background

With the development of computer technology, image processing technology has also developed. By means of computer vision and image processing methods, the target object in the video or image sequence can be automatically identified, tracked and positioned. For example, one or more target objects of interest may be tracked in successive frames and provided with information such as their 3D shape, pose, position, velocity, or trajectory. It is noted that the object tracking technology still has some challenges in practical applications, such as shielding, illumination change, scale change, and the like. Therefore, it is interesting and urgent how to select an appropriate target tracking technique based on a specific scene and requirement.

Disclosure of Invention

In a first aspect of the present disclosure, a target tracking method is provided. The method comprises the following steps: extracting a plurality of time domain image feature graphs and a plurality of optical flow feature graphs among a plurality of frames of a target video; determining a three-dimensional reference position of at least one object in each of a plurality of frames based on the plurality of temporal image feature maps, the at least one object comprising a target object; determining at least a plurality of motion bias maps for a plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, each motion bias map indicating a motion bias of at least one object in a corresponding frame relative to a previous frame in three-dimensional space; and determining a motion trajectory of the target object in the three-dimensional space based at least on the plurality of motion bias maps by referencing the three-dimensional reference position of the target object in each of the plurality of frames.

In a second aspect of the present disclosure, a target tracking device is provided. The device comprises: an extraction module configured to extract a plurality of temporal image feature maps and a plurality of optical flow feature maps between a plurality of frames of a target video; a three-dimensional reference position determination module configured to determine a three-dimensional reference position of at least one object in each of a plurality of frames, the at least one object comprising a target object, based on the plurality of time-domain image feature maps; a motion bias map determination module configured to determine at least a plurality of motion bias maps for a plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, each motion bias map indicating a motion bias of at least one object in a corresponding frame relative to a previous frame in three-dimensional space; and a motion trajectory determination module configured to determine a motion trajectory of the target object in a three-dimensional space based at least on the plurality of motion bias maps by referencing a three-dimensional reference position of the target object in each of the plurality of frames.

In a third aspect of the present disclosure, an electronic device is provided. The electronic device comprises at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit cause the electronic device to perform the method of the first aspect of the disclosure.

In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon a computer program executable by a processor to perform the method according to the first aspect of the present disclosure.

It should be understood that what is described in this summary is not intended to limit the critical or essential features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages, and aspects of various implementations of the present disclosure will become more apparent hereinafter with reference to the following detailed description in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:

FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure may be implemented;

FIG. 2 illustrates a flow chart for a target tracking process according to some embodiments of the present disclosure;

FIG. 3A illustrates a schematic diagram of an example model architecture for target tracking, according to some embodiments of the present disclosure;

FIG. 3B illustrates a schematic diagram of an example architecture of a detection model, according to some embodiments of the present disclosure;

FIG. 3C illustrates a schematic diagram of an example architecture of a first tracking model, according to some embodiments of the present disclosure;

FIG. 3D illustrates a schematic diagram of an example architecture of a second tracking model, according to some embodiments of the present disclosure;

FIG. 3E illustrates a schematic diagram of an example architecture of a morphology determination model, according to some embodiments of the present disclosure;

FIG. 4 illustrates a block diagram of an apparatus for target tracking according to some embodiments of the present disclosure; and

fig. 5 illustrates a block diagram of an electronic device in which one or more embodiments of the disclosure may be implemented.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the embodiments of the present disclosure have been illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather, embodiments are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be noted that, in the technical solution of the present disclosure, the acquisition, storage, application, etc. of the related personal information of the user all conform to the rules of the related laws and regulations, and do not violate the popular regulations of the public order.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions are also possible below.

As used herein, the term "model" may learn the association between the respective inputs and outputs from training data so that, for a given input, a corresponding output may be generated after training is completed. The generation of the model may be based on machine learning techniques. Deep learning is a machine learning algorithm that processes inputs and provides corresponding outputs through the use of multiple layers of processing units. The "model" may also be referred to herein as a "machine learning model," "machine learning network," or "network," and these terms are used interchangeably herein.

Generally, machine learning may generally include three phases, namely a training phase, a testing phase, and an application phase (also referred to as an inference phase). In the training phase, a given model may be trained using a large amount of training data, iteratively updating parameter values until the model is able to obtain consistent inferences from the training data that meet the desired goal. By training, the model may be considered to be able to learn the association between input and output (also referred to as input to output mapping) from the training data. Parameter values of the trained model are determined. In the test phase, test inputs are applied to the trained model to test whether the model is capable of providing the correct outputs, thereby determining the performance of the model. In the application phase, the model may be used to process the actual input based on the trained parameter values, determining the corresponding output.

As briefly mentioned above, with the target tracking technique, the 3D shape, pose, trajectory, or the like of the target object can be tracked and restored from the monocular video captured by the camera. However, target tracking techniques can face some challenges. For example, in some scenes, interaction and shielding exist between a target object photographed by a camera and other objects in the motion process, so that tracking of the target object in a complex scene is very difficult; in other scenarios, the camera itself is also moving when shooting the target object. Because monocular video has depth ambiguity, the 3D motion of the camera and the target object needs to be accurately analyzed from the blurred image information, so as to calculate the 3D motion trail of the target object in the world coordinate system.

In some schemes, the target object is a human body, and the image comprising the human body is input into a trained single-step network, so that more complete human body relative relation information can be obtained, and further, the 3D shape and posture estimation of the human body in a multi-person scene can be realized.

However, such a scheme does not intuitively track human motion trajectories in the time domain on model design, nor model camera 3D motions, and estimates 3D pose and shape of each person by image features only. Such methods are essentially based on the characterization of 2D single images and cannot model time domain information. In the face of occlusion, etc., characterization of such methods does not take advantage of temporal motion consistency to achieve accurate and stable estimates. Furthermore, because of lack of modeling of camera motion, the motion trajectory of the target object in the world coordinate system cannot be restored.

In other approaches, the motion trajectories of the target object in the world coordinate system are extrapolated based on a priori indications by modeling the local motion situation of the target object. For example, using a multi-step network, each human body in the image is first detected, while a 2D trajectory of the target human body on the image is estimated using a human body tracking model. On the basis, the 3D shape and the gesture of each human body in each detection frame are estimated, and then the motion trail of each human body in the world coordinate system is reversely calculated by analyzing the motion condition of the local body part of the human body. For example, the running track of the target human body in the world coordinate system can be calculated through analysis of the running 3D motion.

However, in the actual scene, the motion track of the human body in the world coordinate system is not only related to the motion of the human body, but also influenced by foreign objects or external forces. For example, in the context of rowing, riding, skating, etc., the human body itself is only partially or even completely motionless, but its 3D motion in the world coordinate system is often very noticeable. Therefore, such a scheme does not have versatility.

Still other schemes estimate the motion trail of the target object in the camera space through a multi-step network, and then estimate the motion trail of the camera from the monocular video by adopting a motion restoration structure (Structure From Motion, abbreviated as SFM) algorithm, so as to reversely deduce the motion trail of the target object in the world coordinate system.

However, such a camera motion estimation method based on the SFM algorithm is designed for a static scene without a moving object, and relies on static association key points among multiple frames to solve the camera motion. For tracking dynamic scenes of moving targets, stable static association key points are difficult to find, so that in actual use, the result of camera motion estimation is quite poor.

To at least partially address the above-mentioned problems, as well as other potential problems that may exist in conventional approaches, the present disclosure provides an improved approach for target tracking. According to various embodiments of the present disclosure, a plurality of temporal image feature maps and a plurality of optical flow feature maps between a plurality of frames of a target video are extracted. Based on the plurality of temporal image feature maps, a three-dimensional reference position of at least one object in each of the plurality of frames is determined. The at least one object includes a target object. Based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, at least a plurality of motion bias maps for a plurality of frames are determined. Each motion offset map indicates a motion offset of at least one object in three-dimensional space in a corresponding frame relative to a previous frame. A motion trajectory of the target object in three-dimensional space is determined based at least on the plurality of motion bias maps by referencing three-dimensional reference positions of the target object in respective ones of the plurality of frames. Therefore, accurate and efficient target tracking can be realized based on the motion bias diagram. In this way, no additional model design such as post-processing is needed, the algorithm complexity is low, and the method is easy to be practically applied.

Example embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

FIG. 1 illustrates a schematic diagram of an example environment 100 in which embodiments of the present disclosure may be implemented. In the environment 100, various client applications, such as an image processing class application, a three-dimensional modeling class application, and the like, may be installed on the terminal device 110. Model 130 is configured to process, e.g., target track, the target video.

In some embodiments of the present disclosure, the model 130 used may be deployed at the remote device 120. Terminal device 110 may communicate (e.g., via a network) with remote device 130 to perform model reasoning tasks, i.e., target tracking tasks for target video, with model 130 stored thereon. The target video may be stored directly at the remote device 120 or may be transmitted to the remote device 120 after being acquired by the terminal device 110. In some embodiments, the model 130 may also be deployed partially or wholly locally at the terminal device 110, run by the terminal device 110 to perform the target tracking task of the target video. Embodiments of the present disclosure are not particularly limited thereto.

The terminal device 110 may be any type of mobile terminal, fixed terminal, or portable terminal, including a mobile handset, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, media computer, multimedia tablet, personal Communication System (PCS) device, personal navigation device, personal Digital Assistant (PDA), audio/video player, digital camera/camcorder, positioning device, television receiver, radio broadcast receiver, electronic book device, game device, or any combination of the preceding, including accessories and peripherals for these devices, or any combination thereof. Remote device 120 may include, for example, a computing system/server, such as a mainframe, edge computing node, computing device in a cloud environment, virtual machine, and so on. Although shown as a single device, remote device 130 may include multiple physical devices. Further, while only a single terminal device 110 is shown, the remote device 120 or the model 130 deployed therein may be accessed by multiple terminal devices 110 to provide the inference capabilities of the model 130.

It should be understood that the structure and function of environment 100 are described for illustrative purposes only and are not meant to suggest any limitation as to the scope of the disclosure.

For the purpose of target tracking, the present disclosure adopts a motion bias diagram to characterize a time domain motion condition of a target object in a camera space and/or a three-dimensional space (for example, a space corresponding to a world coordinate system), and tracks and restores a three-dimensional motion track of the target object based on a reference position of the target object. In some embodiments, three-dimensional reconstruction may also be utilized to restore the morphology of the target object.

According to various embodiments of the present disclosure, a single-step, end-to-end, target tracking network may be implemented. In some embodiments, remote device 120 may implement such a target tracking network using model 130, and training, testing, and application of model 130 are all end-to-end.

Fig. 2 illustrates a flow chart of a target tracking process 200 according to some embodiments of the present disclosure. Process 200 may be implemented in environment 100. Process 200 may be implemented at terminal device 110. In some embodiments, the terminal device 110 may utilize a model 130 running locally or at the remote device 120 to enable target tracking.

The terminal device 110 acquires a target video containing a target object. Such target video may include video or a sequence of images. The target object may include a human body, an animal, a vehicle, and the like. For example, terminal device 110 may obtain video containing the target object from a local or other storage device (e.g., other terminal devices 110 or a third party data platform, etc.). In some embodiments, the target video comprises monocular video captured by a dynamic camera.

The number of target objects to be tracked in the target video may be single or multiple. In some practical scenarios, however, such video may include multiple objects. Terminal device 110 may receive a designation of a target object (e.g., receive a user selection) and determine one or more of the plurality of objects present in the target video as the target object. In some embodiments, terminal device 110 may also automatically determine all objects in a reference frame (e.g., the first frame of the target video or the first frame of the appearance object) as target objects.

At block 210, terminal device 110 extracts a feature map between frames of a target video. Such feature maps include time-domain image feature maps and optical flow feature maps. The temporal image feature map may be a feature representation that analyzes and extracts successive frames in the temporal dimension. Based on such a time domain image feature map, dynamic information of the change of the pixel points or the local area with time can be acquired. For example, a frame difference method may be employed to express the difference between the current frame and the previous frame as a feature in order to detect a change occurring in a moving object or scene. The optical flow feature map may be used to compute and map optical flow vectors into color or grayscale images to more intuitively understand and analyze the motion pattern of the target object between successive frames.

In some embodiments, the remote device 130 may utilize the model 130 for feature extraction. Fig. 3A illustrates a schematic diagram of an architecture of an example model 130 for target tracking, according to some embodiments of the present disclosure. The model 130 is generally divided into a feature extraction module 310 and a target tracking model 320.

In some embodiments, feature extraction module 310 includes image feature extractor 302. Terminal device 110 may extract a temporal image feature map between frames of the target video using image feature extractor 302.

As one example, image feature extractor 302 may include an image backbone network and a temporal feature delivery module. For example, it is desirable to select a frame f for consecutive 1 st frames in the target video 301 ₁ To the nth frame f _N Object tracking is performed. The terminal device 110 inputs the target video 301 into the image backbone network to extract image features of a single frame. Such image backbone networks include, for example, high performance deep neural networks such as ResNet, HRNet, and the like. Such an image backbone network may have multi-resolution feature fusion capabilities, capabilities to handle multi-scale variations, and network depth. It should be appreciated that any suitable image feature extractor 302 may be selected based on the overall requirements of the model 130, which is not limiting in this disclosure.

The temporal feature transfer module is configured to receive the image feature and convert the image feature into a temporal image feature 303, so that modeling of temporal information can be implemented. Such a time domain feature delivery module is for example composed of a two-layer convolution gating loop unit (Gated Recurrent Unit, abbreviated as GRU). Convolved GRU is a type of convolution-combinedThe neural network (CNN) and gating loop unit (GRU) models can handle temporal features of multiple frames. For example, the image backbone network receives the i-1 th frame and the i-th frame of the target video 301, extracts the image feature of each frame, and outputs the image feature map F _i-1 And image feature map F _i . The time domain feature transfer module receives the image feature map F _i-1 And image feature map F _i And outputs a time domain image feature map F _i `。

In some embodiments, feature extraction module 310 also includes optical flow network 304. Terminal device 110 utilizes optical flow network 304 to extract an optical flow feature map 305 between adjacent frames (e.g., neighboring frames) in the target video. Such an optical flow network 304 extracts an optical flow map of the previous frame to the current frame, for example, using a RAFT algorithm. For example, optical flow network 304 receives the i-1 st and i-th frames of target video 301 and outputs optical flow feature O _i 。

With continued reference to fig. 2, at block 220, terminal device 110 determines a three-dimensional reference position of an object in each of a plurality of frames based on the plurality of temporal image feature maps.

In some embodiments, the target tracking module 320 includes a detection model 330. Fig. 3B illustrates a schematic diagram of an example architecture 300B of a detection model, according to some embodiments of the present disclosure. The detection model 330 receives the temporal image feature map 303 and processes it to output a three-dimensional reference position 336. The three-dimensional reference position 336 may be a center position of the target object, for example, represented by center point coordinates. As one example, when the target object is a human body, the three-dimensional reference position 336 may be, for example, a center of gravity position, a shoulder position, a torso center position, and the like of the human body. The three-dimensional reference position 336 may also be represented by a bounding box representing the spatial position occupied by the target object, such as by the upper left and lower right corner coordinates of the bounding box, or by the center point coordinates, width, height. Additionally or alternatively, the detection model 330 may also output a confidence 333 corresponding to the three-dimensional reference position 336. The higher the confidence, the higher the probability that each pixel belongs to the three-dimensional reference position of the object.

In some embodiments, terminal device 110 may determine a plurality of reference position probability maps 332 and a plurality of reference position offset vector maps 335 for a plurality of frames through a three-dimensional reassembly process based on temporal image feature map 303. Each reference position probability map 332 indicates the confidence 333 that the pixels in the corresponding frame belong to the reference position of the object. Each reference position offset vector map 335 indicates an offset of a pixel in the corresponding frame relative to a reference position of the object. Further, the terminal device 110 determines candidate reference positions of the object in each frame based on the confidence 333 in the multiple reference position probability maps, and determines the three-dimensional reference position 336 of the object based on the candidate reference positions and their corresponding offsets.

As one example, the target video 301 includes a continuous 1 st frame f ₁ To the nth frame f _N . The target object is a human body. Detection model 330 receives temporal image feature map F output by feature extraction module 310 for the i-1 th and i-th frames of target video 301 _i And outputs a reference position probability map 332 for the ith frame after the three-dimensional rebinning process (e.g., with tensorsRepresentation) and a reference position offset vector diagram (e.g. with tensor +.>Representation). Such a reference position probability map 332 includes, for example, a rough position of the center of gravity of the human body +.>Such a reference position offset vector map 335 includes, for example, a precise positioning offset vector Δt of the human body _i . Rough position of the centre of gravity of the human body>And a precise positioning offset vector deltat _i The detection model 330 outputs a three-dimensional reference position t of the human body for the ith frame _i And confidence level c _i 。

To solve the problem of mutual occlusion between objects, the terminal device 110 may determine the three-dimensional reference position 336 based on the reference position probability map 332 and the reference position offset vector map 335 at two perspectives.

In some embodiments, the terminal device 110 may determine the first reference position probability map and the first reference position offset vector map at the first view angle and the second position probability map and the second reference position offset vector map at the second view angle, respectively, through a three-dimensional reorganization process based on the temporal image feature map 303. And, the reference position probability map 332 of the same frame is determined by combining the first reference position probability map and the second reference position probability map of the frame, and the reference position offset vector map 335 of the frame is determined by combining the first reference position offset vector map and the second reference position offset vector map.

In some embodiments, the terminal device 110 may determine the reference position probability map 332 and the reference position offset vector map 335 through a three-dimensional reorganization process based on Bird's Eye View (BEV) based on the time-domain image feature map 303. Thus, the first viewing angle comprises a main viewing angle and the second viewing angle comprises a bird's eye view angle.

Continuing with the example above, detection model 330 receives temporal image feature map F output by feature extraction module 310 for the i-1 st and i-th frames of target video 301 _i And outputs a reference position probability map 332 and a reference position offset vector map 336 after the BEV-based three-dimensional rebinning process. Such a reference position probability map 332 includes, for example, a rough position of the center of gravity of the human body at the main view angle and a rough position of the center of gravity of the human body at the bird's eye view angle. Such a reference positional deviation vector map 336 includes, for example, a precise positional deviation vector of the center of gravity of the human body at the main view angle and a precise positional deviation vector of the center of gravity of the human body at the bird's eye view angle. By combining these positions at the main view angle and at the bird's eye view angle and the positioning offset vector, the detection model 330 outputs the three-dimensional reference position t of the target object of the i-th frame _i And confidence level c _i 。

With continued reference to fig. 2, at block 230, terminal device 110 may determine at least a plurality of motion bias maps for a plurality of frames based on the plurality of temporal image feature maps 303 and the plurality of optical flow feature maps 305. In block 240, the terminal device 110 may determine a motion trajectory of the target object in three-dimensional space based at least on the plurality of motion bias maps by referencing three-dimensional reference positions of the target object in respective ones of the plurality of frames.

In the example of FIG. 3A, the target tracking module 320 may determine a motion bias map for all objects in each frame based on the image feature map 303 and the optical flow feature map 305, and further sample the motion bias map based on the three-dimensional reference position 336 of the target object to determine the motion trajectory 306 of the target object. In some embodiments, the target tracking module 320 further includes a first tracking model. The terminal device 110 may utilize such a first tracking model to track the motion trajectory 306 of the target object. In some embodiments, the target tracking module 320 further includes a second tracking model. Terminal device 110 may determine the target object from all objects using such a second tracking model. In some embodiments, the target tracking module 320 further includes a morphology determination model 350. The terminal device 110 may perform three-dimensional reconstruction based on the shape and the posture of the target object by using such a morphological determination model 350, and further combine the motion trajectory of the target object with the three-dimensional reconstruction result to output the motion trajectory 306 of the target object in the three-dimensional space.

It should be appreciated that the functional partitioning of the detection model 330, the first tracking model 340, the second tracking model 350, and the morphology determination model 360 included by the target tracking module 320 is merely exemplary. These models may be further split or combined and the extracted feature maps may also be processed in parallel or at least partially in parallel. The target tracking module 320 may also include any suitable model to implement an end-to-end network of motion trajectories from receiving target video to outputting target objects.

In some embodiments, the first tracking model 340 determines a plurality of second motion bias maps based on the temporal image feature map 303 and the optical flow feature map 305. The first tracking model 340 determines a motion trajectory of the target object in three-dimensional space based at least on such a second motion bias map. Such a three-dimensional space is, for example, a space in a world coordinate system with the initial position of the target object in the reference frame as the origin. Such a three-dimensional space may be, for example, a space in a world coordinate system with the center position of one of the target objects in the reference frame as the origin, if there are a plurality of target objects.

In some embodiments, the second tracking model 350 determines the plurality of first motion bias maps based on the temporal image feature map 303 and the optical flow feature map 305. The second tracking model 350 determines a motion trajectory of the target object in three-dimensional space based at least on such a first motion bias map. Such a three-dimensional space is, for example, a three-dimensional space based on a camera coordinate system. Thus, the first motion bias map output by the second tracking model 350 and the second motion bias map output by the first tracking model 340 include motion biases in different coordinate systems.

The first tracking model 340, the second tracking model 350, and the morphology determination model 360 will be described below with reference to fig. 3C to 3E, respectively.

Fig. 3C illustrates a schematic diagram of an example architecture 300C of a first tracking model, according to some embodiments of the present disclosure. The first tracking model 340 receives the time-domain image feature map 303 and the optical flow feature map 305 of each frame, and outputs a motion locus 346 of the target object in the world coordinate system.

In some embodiments, the first tracking model 340 may determine the second motion bias map 344 for each frame based on the temporal image feature map 303 and the optical flow feature map 305. The second motion offset map 344 indicates the three-dimensional motion offset 345 of all objects in the corresponding frame relative to the previous frame in the world coordinate system. Further, the first tracking model 340 determines a three-dimensional orientation map 342 for each frame based on the plurality of time-domain image feature maps 303 and the plurality of optical flow feature maps 305. The three-dimensional orientation map 342 indicates the three-dimensional orientation 343 of all objects in the corresponding frame under the world coordinate system.

In some embodiments, the first tracking model 340 may determine the second motion bias map 344 through a BEV-based three-dimensional rebinning process.

In some embodiments, the first tracking model 340 may refer to a three-dimensional reference position of the target object, determine a motion trajectory 346 of the target object in the world coordinate system based on the second motion bias map 344 and the three-dimensional orientation map 342.

As one example, the first tracking model 340 includes, for example, a deep convolutional neural network of two res net. The first ResNet receive feature extraction module 310 outputs a temporal image feature map F for the i-1 st and i-th frames of the target video 301 _i ' optical flow feature map O _i To estimate a first motion bias map 344 in world coordinate system (e.g., with tensorsRepresentation). The second ResNet receives the time domain image feature map F _i ' optical flow feature map O _i To estimate a three-dimensional orientation map 342 in the world coordinate system. The first tracking model 340 samples the three-dimensional orientation map 342 based on the three-dimensional reference position corresponding to the confidence exceeding the confidence threshold, and outputs the three-dimensional orientation τ of the target object _i . The first tracking model 340 samples the second motion bias map 344 based on the three-dimensional reference position corresponding to the confidence level exceeding the confidence level threshold and outputs a three-dimensional motion bias Δt for the target object _i . Thus, the first tracking model 340 is based on the three-dimensional orientation τ of the target object _i And a three-dimensional motion offset DeltaT _i To determine a motion profile 346 in a world coordinate system. For example, the motion trajectory of the kth target object in the ith frame may be characterized as a three-dimensional motion bias +. >And three-dimensional orientation->

Fig. 3D illustrates a schematic diagram of an example architecture 300D of a second tracking model, according to some embodiments of the present disclosure. The second tracking model 350 receives the temporal image feature map 303 and the optical flow feature map 305 of each frame, and outputs a motion locus 346 of the target object in the camera coordinate system.

In some embodiments, the second tracking model 350 may determine the first motion bias map for each frame based on the temporal image feature map and the optical flow feature map. The first motion offset map indicates the motion offset of all objects in the corresponding frame relative to the previous frame under the camera coordinate system. Based on such a first motion bias map 351, the second tracking model 350 may determine a motion trajectory 355 of the target object under the camera coordinate system. The motion trail can be used as intermediate data in the model training process and can also be used as data output by the model application process.

In some embodiments, the second tracking model 340 may determine the first motion bias map 351 through a BEV-based three-dimensional rebinning process.

In some embodiments, the second tracking model 350 identifies target objects from all objects based on tracking identifications of the objects. Further, the second tracking model 340 may determine a motion trajectory 355 of the target object under the camera coordinate system by the three-dimensional reference position of the target object and the first motion bias map 351 in each frame. Accordingly, the first tracking model 340 may also determine a motion trajectory 346 of the target object in the world coordinate system by the three-dimensional reference position of the target object and the second motion bias map 344 in each frame.

As an example, the second tracking model 350 includes, for example, a deep convolutional neural network of res net. Such ResNet receive feature extraction module 310 outputs a temporal image feature map F for the i-1 th and i-th frames of the target video 301 _i ' optical flow feature map O _i To estimate a second motion bias map 351 in the camera coordinate system (e.g., with tensorsRepresentation). The second motion bias map 351 includes three-dimensional motion biases 352 for all objects. The second tracking model 350 receives the three-dimensional reference positions 336 and corresponding confidence 333 of all objects from the detection model 330 and inputs into the memory unit 353 along with the three-dimensional motion bias 352. Based on tracking identifier 354 output by memory unit 350, second tracking model 350 may distinguish the target object from all objects.

In some embodiments, the second tracking model 350 determines an initial reference position of the target object in the reference frame based on a user selection of the target object in the reference frame of the target video 301. Such target objects are assigned target tracking identifications. The second tracking model 350 determines a first motion bias map 351 based on the temporal image feature map 303 and the optical flow feature map 305, and extracts three-dimensional motion biases 352 for all objects from the first motion bias map 351. The three-dimensional motion bias 352 may indicate a three-dimensional change in position of the object across adjacent frames. Further, the second tracking model 350 may determine respective tracking identifications 354 of the other objects except the target object in all the objects based on the matching result of the three-dimensional reference position 336, the confidence 333, the three-dimensional motion bias 352, and the initial position of the target object of all the objects in each frame. Further, the second tracking model 350 may determine a target object from all objects based on their respective tracking identifications 354.

As one example, terminal device 110 may receive a user's designation and determine a target object from a first frame of target video 301. The target object is assigned a target tracking identifier of 0. The memory unit 353 stores the initial reference position of the target object and the target tracking identifier 0. The second tracking model 350 matches the three-dimensional reference positions and three-dimensional motion offsets corresponding to the confidence levels exceeding the threshold confidence levels of all objects in each frame with the initial reference positions of the target objects. For a matched object, the second tracking model 350 determines it as the target object, with its target tracking identification being 0. For non-matching objects, such as newly appearing objects with respect to the first frame, the second tracking model 350 assigns a new tracking identifier, such as tracking identifier 1, thereto.

Further, the second tracking model 350 samples the first motion bias map 351 based on the determined three-dimensional reference position corresponding to the confidence level of the target object exceeding the confidence level threshold, and outputs the three-dimensional motion bias Δm of the target object _i . The second tracking model 350 is based on the three-dimensional motion bias Δm of the target object in each frame _i To determine the motion trajectory 355 of the target object in the camera coordinate system. For example, the motion trail of the kth target object in the ith-1 frame corresponds to Combined with three-dimensional motion bias->After that, its motion trajectory in the i-th frame corresponds +.>

Fig. 3E illustrates a schematic diagram of an example architecture of a morphology determination model 360, according to some embodiments of the present disclosure. The morphology determination model 360 receives the time-domain image feature map 303 and the optical flow feature map 305 of each frame, and outputs the shape and posture 363 of the target object.

In some embodiments, morphology determination model 360 determines morphology feature map 361 for all objects of each frame based on temporal image feature map 304 and optical flow feature map 305. Such a morphology map 361 indicates the shape and pose of all objects in the corresponding frame. Further, the morphology determination model 360 receives the determined three-dimensional reference position and confidence 341 of the target object of the corresponding frame from the second tracking model 350 and determines the shape and pose 363 of the target object of the frame from the morphology feature map 361.

As one example, morphology determination model 360 is, for example, a Skinned Multi-Person Linear (SMPL) model based on modeling and generating a Multi-Person pose. Such models may represent each person's body shape, joint angles, and limb deformations by parameterization. Specifically, the SMPL model is based on the time domain image feature map F of the ith frame _i "and optical flow feature map O _i Determining grid parameters (e.g. using tensorsRepresentation) and samples feature vectors of the human body according to the human body barycenter position with the highest confidence corresponding to the target human body. Such feature vectors are input to the full connection layer 362 to estimate the SMPL parameters of the human body, such as the shape parameter θ _i And attitude parameter beta _i 。

The various models included by the object tracking module 320 are described above. In some embodiments, the target tracking module 320 is further configured to reconstruct the target object in three dimensions based on the motion trajectories of the target object in three dimensions and the shape and pose at each frame. Based on the three-dimensional reconstruction results, the target tracking module 320 may output a motion trajectory having a three-dimensional morphology of the target object, such as shown by the motion trajectory 306 of the target object in the example of fig. 3A.

In some embodiments, referring to fig. 1, terminal device 110 utilizes model 130 to implement target tracking. The application of the model 130 is described above. In some embodiments, model 130 may be trained by remote device 120, by terminal device 110, or by other devices or in a cloud environment. During the model training phase, the model 130 receives input of a sample video and monitors the three-dimensional reference position, the two-dimensional form projected into the image, and the two-dimensional position of the sample object in the sample video under the camera coordinate system and under the world coordinate system. The loss function includes the true and predicted values of these parameters. In the model application phase, the model 130 receives the target video and the specified target object, i.e., the morphology of the target object in each frame and the motion trail in the world coordinate system and the camera coordinate system can be obtained in a single step from end to end.

In summary, the disclosure proposes a target tracking network based on motion bias map characterization, which can estimate the three-dimensional shape, posture and motion trajectory of a target object in a world coordinate system from monocular video shot by a dynamic camera. Such a target tracking network enables a single step, end-to-end implementation of the algorithm.

Example apparatus

Fig. 4 illustrates a block diagram of an image verification apparatus 400 according to some embodiments of the present disclosure. Apparatus 400 may be implemented in remote device 120 and/or terminal device 110. The various modules/components in apparatus 400 may be implemented in hardware, software, firmware, or any combination thereof.

Apparatus 400 includes an extraction module 410 configured to extract a plurality of temporal image feature maps and a plurality of optical flow feature maps between a plurality of frames of a target video. The apparatus 400 further comprises a three-dimensional reference position determination module 420 configured to determine a three-dimensional reference position of at least one object in each of a plurality of frames based on the plurality of temporal image feature maps. The at least one object includes a target object. The apparatus 400 further comprises a motion bias map determination module 430 configured to determine at least a plurality of motion bias maps for a plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps. Each motion offset map indicates a motion offset of at least one object in three-dimensional space in a corresponding frame relative to a previous frame. The apparatus 400 further comprises a motion trajectory determination module 440 configured to determine a motion trajectory of the target object in three-dimensional space based at least on the plurality of motion bias maps by referencing three-dimensional reference positions of the target object in respective ones of the plurality of frames.

In some embodiments, the three-dimensional reference position determination module 420 is further configured to determine, by a three-dimensional rebinning process, a plurality of reference position probability maps for a plurality of frames, each reference position probability map indicating a confidence that each pixel in the corresponding frame belongs to a reference position of the object, and a plurality of reference position offset vector maps, each reference position offset vector map indicating an offset of the pixel in the corresponding frame relative to the reference position of the object, based on the plurality of temporal image feature maps; determining candidate reference positions for at least one object in each of the plurality of frames based on the confidence levels in the plurality of reference position probability maps; and determining a three-dimensional reference position of the at least one object in each of the plurality of frames based on the candidate reference position and the candidate reference position corresponding offset for the at least one object in each of the plurality of frames.

In some embodiments, the three-dimensional reference position determination module 420 is further configured to determine a plurality of first reference position probability maps and a plurality of first reference position offset vector maps for the plurality of frames at the first perspective based on the plurality of temporal image feature maps; determining a plurality of second position probability maps and a plurality of second reference position offset vector maps for the plurality of frames at a second view angle based on the plurality of temporal image feature maps; for each frame of the plurality of frames, determining a reference position probability map for the frame by combining the first reference position probability map and the second reference position probability map for the frame; and for each frame of the plurality of frames, determining a reference position offset vector map for the frame by combining the first reference position offset vector map and the second reference position offset vector map for the frame.

In some embodiments, the first viewing angle comprises a main viewing angle and the second viewing angle comprises a bird's eye view.

In some embodiments, the motion bias map determination module 430 is further configured to determine a plurality of first motion bias maps for a plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, each first motion bias map indicating a motion bias of at least one object in a corresponding frame relative to a previous frame under a camera coordinate system; and wherein the motion profile of the target object comprises a motion profile of the target object under a camera coordinate system.

In some embodiments, the motion bias map determination module 430 is further configured to determine a plurality of second motion bias maps for the plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, each second motion bias map indicating a three-dimensional motion bias of the at least one object in the corresponding frame relative to the previous frame in the world coordinate system; and determining a plurality of three-dimensional orientation maps for the plurality of frames based on the plurality of time-domain image feature maps and the plurality of optical flow feature maps, each three-dimensional orientation map indicating a three-dimensional orientation of at least one object in the corresponding frame under the world coordinate system.

In some embodiments, the motion trajectory determination module 440 is further configured to determine a motion trajectory of the target object in the world coordinate system based on the plurality of second motion bias maps and the plurality of three-dimensional orientation maps by referencing the three-dimensional reference position of the target object.

In some embodiments, the motion trajectory determination module 440 is further configured to identify a target object from the at least one object based on the respective tracking identification of the at least one object; extracting three-dimensional motion offsets of the target object from at least the plurality of motion offset maps, respectively, by three-dimensional reference positions of the target object in each of the plurality of frames; and determining a motion trajectory of the target object in three-dimensional space based at least on the extracted three-dimensional motion bias of the target object.

In some embodiments, the motion trajectory determination module 440 is further configured to determine an initial reference position of the target object in the reference frame based on a user selection of the target object in the reference frames of the plurality of frames, the target object being assigned a target tracking identity; determining a three-dimensional motion offset of at least one object in each frame based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, the three-dimensional motion offset being indicative of a three-dimensional positional change of the at least one object across adjacent frames; determining respective tracking identifications of other objects except the target object in at least one object based on the matching result of the three-dimensional reference position, the confidence coefficient and the three-dimensional motion bias of the at least one object in each frame and the initial position of the target object; and determining a target object from the at least one object based on the respective tracking identifications of the at least one object.

In some embodiments, apparatus 400 further comprises a morphology feature map determination module configured to determine a plurality of object morphology feature maps for a plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, each object morphology feature map indicating a shape and pose of at least one object in a corresponding frame; and determining the shape and pose of the target object in the plurality of frames from the plurality of morphology feature maps based on the three-dimensional reference position of the target object in the plurality of frames.

In some embodiments, the apparatus 400 further comprises a three-dimensional reconstruction module configured to reconstruct the target object in multiple frames in three dimensions based on the motion trajectories of the target object in three dimensions and the shape and pose of the target object in multiple frames.

The elements included in apparatus 400 may be implemented in a variety of ways, including software, hardware, firmware, or any combination thereof. In some embodiments, one or more units may be implemented using software and/or firmware, such as machine executable instructions stored on a storage medium. In addition to or in lieu of machine-executable instructions, some or all of the elements in apparatus 400 may be at least partially implemented by one or more hardware logic components. By way of example and not limitation, exemplary types of hardware logic components that can be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standards (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

Fig. 5 illustrates a block diagram of an electronic device 500 in which one or more embodiments of the disclosure may be implemented. It should be understood that the electronic device 500 shown in fig. 5 is merely exemplary and should not be construed as limiting the functionality and scope of the embodiments described herein. The electronic device 500 shown in fig. 5 may be used to implement the remote device 120 and/or the terminal device 110 of fig. 1.

As shown in fig. 5, the electronic device 500 is in the form of a general-purpose electronic device. The components of electronic device 500 may include, but are not limited to, one or more processors or processing units 510, memory 520, storage 530, one or more communication units 540, one or more input devices 550, and one or more output devices 560. The processing unit 510 may be a real or virtual processor and is capable of performing various processes according to programs stored in the memory 520. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capabilities of electronic device 500.

Electronic device 500 typically includes multiple computer storage media. Such a medium may be any available medium that is accessible by electronic device 500, including, but not limited to, volatile and non-volatile media, removable and non-removable media. The memory 520 may be volatile memory (e.g., registers, cache, random Access Memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 530 may be a removable or non-removable media and may include machine-readable media such as flash drives, magnetic disks, or any other media that may be capable of storing information and/or data (e.g., training data for training) and may be accessed within electronic device 500.

The electronic device 500 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in fig. 5, a magnetic disk drive for reading from or writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data medium interfaces. Memory 520 may include a computer program product 525 having one or more program modules configured to perform the various methods or acts of the various embodiments of the present disclosure.

The communication unit 540 enables communication with other electronic devices through a communication medium. Additionally, the functionality of the components of electronic device 500 may be implemented in a single computing cluster or in multiple computing machines capable of communicating over a communication connection. Thus, the electronic device 500 may operate in a networked environment using logical connections to one or more other servers, a network Personal Computer (PC), or another network node.

The input device 550 may be one or more input devices such as a mouse, keyboard, trackball, etc. The output device 560 may be one or more output devices such as a display, speakers, printer, etc. The electronic device 500 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., with one or more devices that enable a user to interact with the electronic device 500, or with any device (e.g., network card, modem, etc.) that enables the electronic device 500 to communicate with one or more other electronic devices, as desired, via the communication unit 540. Such communication may be performed via an input/output (I/O) interface (not shown).

According to an exemplary implementation of the present disclosure, a computer-readable storage medium is provided, on which one or more computer instructions are stored, wherein the one or more computer instructions are executed by a processor to implement the method described above. According to an exemplary implementation of the present disclosure, there is also provided a computer program product tangibly stored on a non-transitory computer-readable medium and comprising computer-executable instructions that are executed by a processor to implement the method described above.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of implementations of the present disclosure has been provided for illustrative purposes, is not exhaustive, and is not limited to the implementations disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various implementations described. The terminology used herein was chosen in order to best explain the principles of each implementation, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand each implementation disclosed herein.

Claims

1. A target tracking method, comprising:

extracting a plurality of time domain image feature graphs and a plurality of optical flow feature graphs among a plurality of frames of a target video;

determining a three-dimensional reference position of at least one object in each of the plurality of frames based on the plurality of temporal image feature maps, the at least one object comprising a target object;

determining at least a plurality of motion bias maps for the plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, each motion bias map indicating a motion bias of the at least one object in a corresponding frame relative to a previous frame in three-dimensional space; and

A motion trajectory of the target object in three-dimensional space is determined based at least on the plurality of motion bias maps by referencing three-dimensional reference positions of the target object in respective ones of the plurality of frames.

2. The method of claim 1, wherein determining the three-dimensional reference position of the at least one object in each of the plurality of frames comprises:

determining, by a three-dimensional rebinning process, a plurality of reference position probability maps for the plurality of frames and a plurality of reference position offset vector maps based on the plurality of time-domain image feature maps, each reference position probability map indicating a confidence that each pixel in the corresponding frame belongs to a reference position of the object, each reference position offset vector map indicating an offset of the pixel in the corresponding frame relative to the reference position of the object;

determining candidate reference positions of the at least one object in each of the plurality of frames based on the confidence in the plurality of reference position probability maps; and

the three-dimensional reference position of the at least one object in each of the plurality of frames is determined based on a candidate reference position of the at least one object in each of the plurality of frames and the candidate reference position corresponding offset.

3. The method of claim 2, wherein determining the plurality of reference position probability maps and the plurality of reference position offset vector maps for the plurality of frames comprises:

determining a plurality of first reference position probability maps and a plurality of first reference position offset vector maps for the plurality of frames at a first view angle based on the plurality of temporal image feature maps;

determining a plurality of second position probability maps and a plurality of second reference position offset vector maps for the plurality of frames at a second view angle based on the plurality of temporal image feature maps;

for each frame of the plurality of frames, determining a first reference position probability map and a second reference position probability map for the frame by combining the reference position probability map and the second reference position probability map for the frame; and

for each frame of the plurality of frames, determining the reference position offset vector map for the frame by combining a first reference position offset vector map and a second reference position offset vector map for the frame.

4. The method of claim 3, wherein the first viewing angle comprises a main viewing angle and the second viewing angle comprises a bird's eye view angle.

5. The method of claim 1, wherein determining at least a plurality of motion bias maps for the plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps comprises:

Determining a plurality of first motion bias maps for the plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, each first motion bias map indicating a motion bias of the at least one object in a corresponding frame relative to a previous frame under a camera coordinate system; and is also provided with

Wherein the motion trail of the target object comprises the motion trail of the target object under the camera coordinate system.

6. The method of claim 1, wherein determining at least a plurality of motion bias maps for the plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps comprises:

determining a plurality of second motion bias maps for the plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, each second motion bias map indicating a three-dimensional motion bias of the at least one object in the corresponding frame relative to a previous frame in a world coordinate system; and

a plurality of three-dimensional orientation maps for the plurality of frames are determined based on the plurality of time-domain image feature maps and the plurality of optical flow feature maps, each three-dimensional orientation map indicating a three-dimensional orientation of the at least one object in the corresponding frame under a world coordinate system.

7. The method of claim 6, wherein determining a motion profile of the target object in three-dimensional space comprises:

determining a motion trajectory of the target object in the world coordinate system based on the plurality of second motion bias maps and the plurality of three-dimensional orientation maps by referencing a three-dimensional reference position of the target object.

8. The method of claim 1, wherein determining a motion profile of the target object in three-dimensional space comprises:

identifying the target object from the at least one object based on the respective tracking identification of the at least one object;

extracting three-dimensional motion offsets of the target object from at least the plurality of motion offset maps, respectively, by three-dimensional reference positions of the target object in respective ones of the plurality of frames; and

a motion trajectory of the target object in three-dimensional space is determined based at least on the extracted three-dimensional motion bias of the target object.

9. The method of claim 8, wherein identifying the target object from the at least one object comprises:

determining an initial reference position of the target object in a reference frame of the plurality of frames based on user selection of the target object in the reference frame, the target object being assigned a target tracking identifier;

Determining a three-dimensional motion offset of the at least one object in each frame based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, the three-dimensional motion offset being indicative of a three-dimensional positional change of the at least one object across adjacent frames;

determining respective tracking identifications of other objects except the target object in the at least one object based on the matching result of the three-dimensional reference position, the confidence level and the three-dimensional motion bias of the at least one object in each frame and the initial position of the target object; and

the target object is determined from the at least one object based on the respective tracking identity of the at least one object.

10. The method of claim 1, further comprising:

determining a plurality of object morphology feature maps for the plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, each object morphology feature map indicating a shape and pose of the at least one object in the corresponding frame; and

the shape and pose of the target object in the plurality of frames is determined from the plurality of morphology feature maps based on the three-dimensional reference position of the target object in the plurality of frames.

11. The method of claim 8, further comprising:

and carrying out three-dimensional reconstruction on the target object in the plurality of frames based on the motion trail of the target object in a three-dimensional space and the shape and the gesture of the target object in the plurality of frames.

12. An apparatus for target tracking, comprising:

an extraction module configured to extract a plurality of temporal image feature maps and a plurality of optical flow feature maps between a plurality of frames of a target video;

a three-dimensional reference position determination module configured to determine a three-dimensional reference position of at least one object in each of the plurality of frames, the at least one object comprising a target object, based on the plurality of time-domain image feature maps;

a motion bias map determination module configured to determine at least a plurality of motion bias maps for the plurality of frames based on the plurality of temporal image feature maps and the plurality of optical flow feature maps, each motion bias map indicating a motion bias of the at least one object in a corresponding frame relative to a previous frame in three-dimensional space; and

a motion trajectory determination module configured to determine a motion trajectory of the target object in three-dimensional space based at least on the plurality of motion bias maps by referencing three-dimensional reference positions of the target object in respective ones of the plurality of frames.

13. An electronic device, comprising:

at least one processing unit; and

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, which when executed by the at least one processing unit, cause the electronic device to perform the method of any one of claims 1 to 11.

14. A computer readable storage medium having stored thereon a computer program executable by a processor to implement the method of any of claims 1 to 11.