WO2024200575A1

WO2024200575A1 - Object tracking circuitry and object tracking method

Info

Publication number: WO2024200575A1
Application number: PCT/EP2024/058367
Authority: WO
Inventors: Marc Osswald
Original assignee: Sony Semiconductor Solutions Corporation; Sony Advanced Visual Sensing Ag
Priority date: 2023-03-28
Filing date: 2024-03-27
Publication date: 2024-10-03

Abstract

The present disclosure generally pertains to object tracking circuitry configured to: obtain time-of-flight data, based on a time-of-flight measurement, the time-of-flight data indicating the object; obtain image data indicating the object and at least one light spot, the light spot originating from light emitted for the time-of-flight measurement; and generate object data based on the time-of-flight data, on the image data, and on an association between the time-of-flight data and the image data, the association being based on the at least one light spot detected in the image data.

Description

OBJECT TRACKING CIRCUITRY AND OBJECT TRACKING METHOD

TECHNICAL FIELD

The present disclosure generally pertains to object tracking circuitry and to an object tracking method.

TECHNICAL BACKGROUND

Generally, it is known to track objects, e.g., based on camera data.

For example, a global shutter (GS) camera may obtain multiple consecutive images and a movement of an object may be determined.

Moreover, time-of-flight sensing is generally known. Time-of-flight cameras may determine a distance to an object based on a roundtrip delay of emitted light or based on a deterioration of emitted light.

Also, event sensing may be generally known. In event sensing, a change of light conditions may be detected, thereby triggering an event.

Although there exist techniques for tracking objects, it is generally desirable to provide object tracking circuitry and an object tracking method.

SUMMARY

According to a first aspect, the disclosure provides object tracking circuitry configured to: obtain time-of-flight data, based on a time-of-flight measurement, the time-of-flight data indicating the object; obtain image data indicating the object and at least one light spot, the light spot originating from light emitted for the time-of-flight measurement; and generate object data based on the time-of-flight data, on the image data, and on an association between the time-of-flight data and the image data, the association being based on the at least one light spot detected in the image data.

According to a second aspect, the disclosure provides an object tracking method comprising: obtaining time-of-flight data, based on a time-of-flight measurement, the time-of-flight data indicating the object; obtaining image data indicating the object and at least one light spot, the light spot originating from light emitted for the time-of-flight measurement; and generating object data based on the time-of-flight data, on the image data, and on an association between the time-of-flight data and the image data, the association being based on the at least one light spot detected in the image data.

Further aspects are set forth in the dependent claims, the drawings, and the following description.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments are explained by way of example with respect to the accompanying drawings, in which:

Fig. 1 depicts a camera according to the present disclosure;

Fig. 2 depicts an object tracking method, as it is generally known;

Fig. 3 depicts a method for object tracking according to the present disclosure, wherein a constant frame rate is applied for ToF measurements;

Fig. 4 depicts how the present disclosure avoids a 2D-3D data correspondence problem which is present in the prior art;

Fig. 5 depicts an embodiment for implicitly generating object data according to the present disclosure;

Fig. 6 depicts a further embodiment of an object tracking method according to the present disclosure, wherein a varying frame rate is applied for ToF measurements;

Fig. 7 depicts, on the left, a conventional ToF illuminator, and on the right, a ToF illuminator according to the present disclosure;

Fig. 8 depicts an AR/VR system according to the present disclosure;

Fig. 9a depicts an embodiment in which an HMD tracks whether a hand of a user interacts with a virtual object by illuminating the hand;

Fig. 9b depicts an embodiment in which an HMD tracks whether a hand of a user interacts with a virtual object by illuminating the virtual object;

Fig. 10 depicts an embodiment of an object tracking method according to the present disclosure in a block diagram;

Fig. 11 depicts a further embodiment of an object tracking method according to the present disclosure in a block diagram, in which a correspondence between ToF data and image data is determined; and Fig. 12 depicts a further embodiment of an object tracking method according to the present disclosure in a block diagram in which it is decided, based on image data, whether ToF data should be obtained.

DETAILED DESCRIPTION OF EMBODIMENTS

Before a detailed description of the embodiments starting with Fig. 1 is given, general explanations are made.

As mentioned in the outset, object tracking is generally known. However, it has been recognized that existing methods lead to unprecise object tracking, particularly when high-speed tracking is desired. Moreover, it has been recognized that in existing method, correspondence between 2D data and 3D data (e.g., depth data) may not be present.

Therefore, some embodiments pertain to object tracking circuitry configured to: obtain time-of- flight data, based on a time-of-flight measurement, the time-of-flight data indicating the object; obtain image data indicating the object and at least one light spot, the light spot originating from light emitted for the time-of-flight measurement; and generate object data based on the time-of- flight data, on the image data, and on an association between the time-of-flight data and the image data, the association being based on the at least one light spot detected in the image data. Circuitry may include any entity or multitude of entities configured to carry out function and/or methods, as described herein, such as a processor (e.g., CPU (Central Processing Unit), GPU (graphics processing unit), or the like), an FPGA (field-programmable gate array, any type of microprocessor, microcomputer, microarray, or the like. Also, different elements of the elements mentioned above may be combined for realizing circuitry according to the present disclosure.

Although the circuitry is an “object tracking circuitry”, in some embodiments, embodiments of the present disclosure also pertain to object detection circuitry, or the like.

Also, some embodiments may pertain to a system, a camera, a computer, or the like, which includes or is connectable (or connected) to circuitry according to the present disclosure.

In some embodiments, the object tracking circuitry is configured to obtain time-of-flight data. Generally, according to the present disclosure, any type of time-of-flight (ToF) technology may be used in order to carry out the embodiments described herein, such as direct time-of-flight (dToF), indirect time-of-flight (iToF), spot time-of-flight, or the like.

Hence, based on a (one or multiple) ToF measurement, the ToF data may be obtained, whereby the object may be indicated in the ToF data. For example, the object may be indicated based on an object tracking algorithm, based on a measured depth, based on a shape, or the like. In some embodiments, the object tracking circuitry is configured to obtain image data. The image data may be generated based on any imaging technology, e.g., based on CCD (charge coupled device) technology, CMOS (complementary metal oxide semiconductor) technology, or the like. The image data may indicate the object based on any imaging method, as well, e.g., based on color imaging, black and white imaging, based on dynamic vision imaging, event sensing, or the like.

Dynamic vision or event sensing imaging may refer to an embodiment in which events may be generated in response to generated charges and the events may be determined (instead of e.g., colors). For example, an event may be determined when there is an intensity change (e.g., determined based on a change in a photocurrent) detected, which is above a predetermined threshold.

Hence, a pixel of an event sensor (EVS) or dynamic vision sensor (DVS) may not trigger as long as the photocurrent remains within a predetermined delta, and may only trigger when this delta is exceeded. Hence, such as pixel may asynchronously generate its events. However, still, such signals may be divided into frames (without limiting the present disclosure in that regard).

The image data may indicate, apart from the object, at least one light spot originating from light emitted for the ToF measurement.

Depending on the used ToF technology (and/or a used illuminator), the “spot” may be realized differently and the present disclosure should not be understood as limiting in that way.

Based on the image data, a change of the light spot(s) may be detected, thereby detecting the object and/or movement of the object.

Hence, object data may be generated based on an association of the ToF data and the image data, wherein the light spot may be taken into account, as well, thereby generating object data.

In some embodiments, the image data include event sensing data, as discussed herein.

In some embodiments, the association is carried out for each time-of-flight measurement of a plurality of time-of-flight measurement, or for a subset (e.g., every second, or for a random number) of ToF measurements.

For example, a frame rate of the image data may be higher than a frame rate of the ToF data and thus, the “correct” image acquisition may be determined for the respective ToF measurement. As indicated above, the association may be carried out based on the at least one light spot such that the assignment of the correct image data to the respective ToF data may be carried out based on the at least one light spot, as well. In some embodiments, the association is an association in time, without limiting the present disclosure in that regard since a spatial association may envisaged alternatively or additionally.

In some embodiments, the time-of-flight data have a first capture rate (e.g., frame rate), wherein the image data have a second capture rate (e.g., frame rate), and wherein the second capture rate is greater than the first capture rate, and the association is further based on the first and the second capture rate.

However, the present disclosure is not limited to the case that a capture rate corresponds to a frame rate. Since in some embodiments, asynchronous EVS data may be obtained, the capture rate may correspond to the rate the events are generated, which may be different for each pixel in an EVS. On the other hand, the ToF data may have a fixed (or varying) frame rate.

Thus, based on the respective capture rates, the association may be carried out.

In some embodiments, the object tracking circuitry is further configured to: determine a correspondence of a first coordinate, indicated by the time-of-flight data, to a second coordinate, indicated by the image data.

The determination of the correspondence may correspond to a “live calibration” of a corresponding (camera) system and may be based on the at least one light spot. Since a position of the at least one light spot may be known based on the ToF data (or based on a position of an illuminator with respect to a ToF sensor), a corresponding spot coordinate (for the at least one spot) may be determined for the image data.

In some embodiments, a coordinate transformation between the two data types may be determined.

In some embodiments, the association is further based on an emission frequency of the light emitted for the time-of-flight measurement.

The emission frequency of a light source may be connected to the frame rate or sampling rate of the ToF data, as commonly known. The emission frequency may also be taken into account for the image data for determining the association, in some embodiments.

In some embodiments, the association is (further) based on a neural network, as will be discussed below.

In some embodiments, the object data are used for hand pose estimation. In some embodiments, the object tracking circuitry is further configured to: carry out a time-of- flight measurement when it is indicated, in the image data, that a predetermined condition is fulfilled.

For example, coarse object data may be determined based on the image data. However, when it is determined, based on the coarse object data, that a finer object detection (or tracking) is necessary, the ToF measurement(s) may start.

This way, energy may be saved since the ToF measurement may only be carried out when necessary.

Some embodiments pertain to an object tracking method including: obtaining time-of-flight data, based on a time-of-flight measurement, the time-of-flight data indicating the object; obtaining image data indicating the object and at least one light spot, the light spot originating from light emitted for the time-of-flight measurement; and generating object data based on the time-of- flight data, on the image data, and on an association between the time-of-flight data and the image data, the association being based on the at least one light spot detected in the image data.

The object tracking method may be carried out with object tracking circuitry according to the present disclosure or with an object tracking system (e.g., camera). Hence, some embodiments pertain to an object tracking system, a camera, a computer, or the like, configured to carry out the methods described herein and/or including circuitry, as described herein.

In some embodiments, the image data include event sensing data, as discussed herein. In some embodiments, the association is carried out of each time-of-flight measurement of a plurality of time-of-flight measurement, as discussed herein. In some embodiments, the association is an association in time, as discussed herein. In some embodiments, the time-of-flight data have a first capture rate, wherein the image data have a second capture rate, and wherein the second capture rate is greater than the first capture rate, and wherein the association is further based on the first and the second capture rate, as discussed herein. In some embodiments, the method further includes: determining a correspondence of a first coordinate, indicated by the time-of- flight data, to a second coordinate, indicated by the image data, as discussed herein. In some embodiments, the association is further based on an emission frequency of the light emitted for the time-of-flight measurement, as discussed herein. In some embodiments, the association is further based on a neural network, as discussed herein. In some embodiments, the object data are used for hand pose estimation, as discussed herein. In some embodiments, the method further includes: carrying out a time-of-flight measurement when it is indicated, in the image data, that a predetermined condition is fulfilled, as discussed herein. The methods as described herein are also implemented in some embodiments as a computer program causing a computer and/or a processor to perform the method, when being carried out on the computer and/or processor. In some embodiments, also a non-transitory computer- readable recording medium is provided that stores therein a computer program product, which, when executed by a processor, such as the processor described above, causes the methods described herein to be performed.

Returning to Fig. 1, there is depicted a camera 1 according to the present disclosure including a dToF unit 2 and an event sensing unit (EVS) 3.

The dToF unit 2 includes a transmitter Tx and a receiver Rx. The transmitter Tx emits light spots (which are incident on a hand, as object 4) and which are reflect from the object 4 and received by the receiver Rx. In response to the reception of the reflection of the spots, ToF data are generated.

Also, ambient light is present (which is symbolized as deriving from the sun, but also room light or any other light source than the transmitter Tx may be considered an ambient light source), which may interfere with the ToF measurement, as generally known. Therefore, in this embodiment, the transmitter Tx uses infrared (IR) light and the receiver Rx includes a bandgap filter to keep such an interference as small as possible. On the other hand, the EVS unit 3 does not include such a filter, such that the ambient light and the IR light are observed by the EVS unit 3.

The ambient light, as well as the reflection of the light spots are captured by the event sensing unit 3, which triggers an event in response to a change in the scene, thereby generating event sensing data.

The ToF data and the event sensing data are obtained by object tracking circuitry 5, which, in this embodiment, is a hand pose estimation (HPE) circuitry (a processor, in this embodiment), further configured to generate object data, as discussed herein.

Based on a feedback loop 6, the illuminator (transmitter Tx) is controlled.

According to the embodiment of Fig. 1, a hand pose is estimated with a high update rate and low latency. The dToF 2 has a sparse output and hence, a sparse projector (transmitter Tx) is used, but the present disclosure is no limited in that regard. The projector is dynamically configurable, i.e., the projected dots may be changed according to the circumstances, e.g., based on the feedback loop. Multiple exposures are used (spatially and temporally separated) to provide a full (or partial, in some embodiments) scan of the field of view.

The EVS 3 is sensitive to passive light an active light from the transmitter Tx (which, in this embodiment, is in an infrared range).

In this embodiment, it is assumed that a distance b (referring to the a baseline between the projector Tx and the EVS) is very small. Thus, it can be estimated where to expect the projected dots in the EVS frame. If there is a significant baseline (b>0 or b»0), the projection of the dots in the EVS frame also depend on the depth.

In this case, the depth is known from the ToF measurement, such that it is possible to determine where the dot is expected to appear in the EVS frame even in the case of an existing baseline (b>0 or b»0).

However, according to the present disclosure, this knowledge is used to refine the depth.

For illustrational purposes, it is assumed, in the following, that the projected dot in the EVS frame is found (either using the depth value from the ToF and the reprojection equations or in some cases, this is not even necessary, and it can be found because it is the only dot that appears in the EVS frame at a given time). Then, the coordinates of the projected dot in the EVS frame can be used, and, with a known baseline and calibration matrices of the EVS and projector, it may be possible to triangulate the ray of the projected dot and the ray of the laser beam (transmitter beam), such that the depth can be calculated.

Moreover, according to the present disclosure, structured light can be used as an additional depth measurement in case the ToF is not reliable or inaccurate (e.g., if the depth is very low) or to further refine the depth.

In some embodiments, “calibration” and “structured-light measurement” are carried out at the same time assuming partial calibration and depth information are known.

For example, if a baseline between Tx and EVS is present (b>0). and if the system is calibrated, this means, there may be mathematical equations which tell about the relation between a ray of incident light (EVS) and a ray of outgoing light (Tx). Every possible point of the laser ray also lies on a line in the EVS frame (epipolar line).

Since the system is calibrated, the epipolar line of the laser beam may be known, i.e., the dot projection in the EVS frame lies somewhere on this line. The exact position of the dot in the EVS frame may depend on the actual depth of the scene at this point. If there are multiple dots, there might be many candidates on that line (known as stereo correspondence problem). With additional information, the right dot correspondence may be found. This additional information is the original depth measurement of the dot (known from the ToF measurement), in some embodiments. Thereby, a good initial estimate can be determined where the projected dot in the EVS frame should be. If the ToF depth measurement is perfect, the prediction is perfect, and thus, the predicted spot is corresponds to the real spot in the EVS frame.

If there is an error in the depth measurement, the prediction will be at a different point, but likely close, such that it can also be found.

Hence, the detected position may be used to compute depth from structured-light and correct the ToF measurement.

More generally, if the prediction and observed position in the EVS frame differ, this may mean that either the ToF measurement is wrong, or the calibration is unprecise.

This information may be used, for example based on probabilistic filtering, such as Kalman filters, this could be formulated as a typical maximum likelihood estimation problem (MLE) where some parameters such as calibration parameters but also depth are jointly estimated based on observed values and known priors, without limiting the present disclosure in that regard.

The additional information used herein correspond to the projected dots in the EVS frame, in some embodiments, which can be either used to refine the calibration (live calibration), for ToF depth measurements, or both.

The case b~0, as depicted in Fig. 1, is a simplified version of the general case. With no (or negligibly small) baseline, there is no need for triangulation or the use of structured light, in some embodiments. On the other hand, the projected dots in the EVS frame do not (or only negligibly) depend on the depth, such that in this case, “only” live-calibration may be carried out since the depth may not be determinable based on the dots in the EVS frame.

According to the present disclosure, high speed depth processing is possible. Sub-depth maps (obtained based on sub-exposures) of the dToF 2 can be directly processed by the HPE 5 without waiting for the full depth scan to be completed. This is possible since high speed data are provided by the EVS 3 which provides a two-dimensional projection from the sub-exposure locations and can be used to put the depth measurements into the right context of the hand 4.

Moreover, live auto-calibration can be achieved according to the embodiments of the present disclosure. The EVS data and the dToF data may be automatically and continuously registered at highest possible precision without the need of an extrinsic calibration model (since they are oriented based on the dots). Furthermore, power efficient light projection may be achieved. If the projector pattern is dynamically controlled, the output of the HPE 5 and/or the EVS data may be used in the (fast) feedback loop 6 to determined which region should be illuminated.

Fig. 2 depicts a method 10 for object tracking, as it is generally known.

Based on a ToF camera and a global shutter (GS) camera, a hand pose output is generated at thirty frames per second (30 fps) (which corresponds to a frame rat of the ToF camera). The estimated hand pose output waits for both outputs (of the ToF and the GS camera).

However, it has been recognized that a 2D-3D data correspondence problem may be present. If, during an acquisition, the object moves quickly, it may not be determinable which part of the object acquired with the GS camera corresponds to which depth.

Therefore, in such a system, a good extrinsic calibration (which may also be robust over the lifetime of the system) may be needed, which can be omitted according to the embodiments of the present disclosure.

Moreover, motion blur may deteriorate the GS acquisition, such that the issue described above may be even more severe.

Fig. 3 depicts a method 20 for object tracking according to the present disclosure.

In contrast to the known method described under reference of Fig. 2, many more EVS acquisitions are obtained than ToF acquisitions.

The ToF acquisitions are be associated with the correct EVS acquisitions, thereby determining a correspondence. Hence, no 2D-3D data correspondence problem may be present, in this embodiment, as discussed under reference of Fig. 4.

In Fig. 4, it is shown that, if a hand moves down quickly, each ToF sub -acquisition determines a different location of the hand, but in the EVS acquisitions, the movement of the hand can be determined in more detail, such that the hand can be tracked, as shown on the bottom of Fig. 4 depicting the light spots along which the hand moved.

Fig. 5 depicts an embodiment for implicitly generating the object data according to the present disclosure. A neural network 30 is configured to estimate a hand pose, i.e., to function as an HPE, as discussed under reference of Fig. 1.

The neural network 30 includes a task head 31 which takes a state as an input and predicts an output, based on a state output by a respective predictor of the predictors 35 to 38 (as discussed in the following). Moreover, the neural network 30 includes a depth and event predictor 35, two event predictors 36 and 37, as well as a further depth and event predictor 38, wherein the depth and event predictors 35 and 38 may be the same and the two event predictors 36 and 37 may also be the same, in some embodiments, only their inputs may be different since the input may correspond to an output of a preceding predictor..

The predictors 35 and 38 are configured to obtain sparse depth data and event data (sparse depth + events), whereas the predictors 36 and 37 are configured to obtain (only) event data.

Furthermore, a current state z_k is fed into the predictor 35 and in the task head 31. A state z_k+l is fed into the predictor 36 and the task head 31, and so on.

More generally speaking, the predictors take as input a current state as well as measurement input, and output an updated state (e.g., in view of z_k, z_k+l is an updated state).

The task head takes the respective state as input and compute a high frequency output (3D hand pose).

According to the present disclosure, a state is a collection (at least one) of features or feature embeddings. These features can be predicted by both predictor types, but they may be more accurate when predicted by the predictors 35 or 38.

The task head uses these features to predict the (hand) output. A feature may, for example, include a “finger edge”, a “hand comer”, a “thumb”, or the like, if a hand is supposed to be tracked. More generally, a feature may correspond to information of the object to be tracked or detected and the feature may be learned by the neural network.

In this embodiment, the neural network is a convolutional neural network, but the present disclosure is not limited to any type of neural network.

The inputs to the predictors and to the task heads may be formatted in any way. For example, in this embodiment, they are formatted as fixed size tensors, such that events are drawn into a fixed size space-time histogram for concatenation with other data, thereby being suitable for the used CNN. On the other hand, inputs may be partitioned as tokens in case of a spatio-temporal transformer network.

A neural network may also be used for (implicitly) carrying out a live auto-calibration, as described herein.

However, such a calibration may be carried out explicitly, in some embodiments. For example, a coordinate transformation may be used for projecting ToF coordinates into the EVS coordinate system, or vice versa. For example, the following formula may be used:

In the above formula, x’ and y’ are predictions in EVS image coordinates of the ToF measurement (u,v,z), based on which “real” x and y-image coordinates are determined, u and v are coordinates of the ToF dot in the ToF image. X is a scale factor which corresponds to the measured z-coordinate in the ToF system, x and y are observed image coordinates of the active dot (corresponding to (u,v)) in the EVS image.

It is assumed, in this embodiment, that the EVS has its optical center at the cartesian coordinates (0, 0, 0). The ToF system has a translation t and rotation R (R, t are the extrinsics) with respect to the EVS camera.

The matrix KEVS corresponds to a 3x3 intrinsic calibration matrix. It projects EVS camera coordinates into EVS image coordinates. -The matrix K^{- 1}TOF is an inverse of KTOF which is a 3x3 intrinsic calibration matrix that projects ToF system coordinates into ToF image coordinates.

Based on such a formula, the dToF dot (u, v) is projected into an EVS frame (x’, y’), the real image location (x, y) (in the EVS frame) is determined, and all events generated by the dot itself are filtered. Thereby, the EVS image is augmented with a depth value providing a depth map (x,y,z)which is used in addition to filtered EVS data for further processing.

Additionally or alternatively, depth may be further refined. If a good calibration between dToF Tx and EVS is already present, depth may be triangulated based on structured light (STL) triangulation by finding temporal correspondences.

STL depth may be computed when the ToF measurement is unreliable (e.g., too close to the object, multi-path interference (MPI), or the like). STL may be carried out explicitly (e.g., triangulation) or implicitly (e.g., neural network).

B ackprojection (with a known but not necessarily precise calibration) (x’, y’; as described above) may be used to resolve STL ambiguity and find the real (x, y)-image coordinates in case there are multiple correspondences on a same epipolar line. Thereby, unique correspondences may be found which may not be determinable only based on temporal information.

In some embodiments, an MLE (maximum likelihood estimation) method may be used to estimate both calibration parameters and depth by maximizing a likelihood of the calibration parameters and the depth based on an estimate/observation of the real location and based on a probabilistic model of the ToF depth measurement and the calibration parameters (e.g., based on the original factory calibration). Fig. 6 depicts a further embodiment of an object tracking method 40 according to the present disclosure. In contrast to the embodiment of Fig. 3, in this embodiment, the frame rate is not constant, but varies.

For high precision, high speed, high power (e.g., for determining whether a user interacts with an object), a higher frame rate is used, whereas, when it is determined that the user does not interact, a lower frame rate, in a “low precision, low power mode” is used.

Fig. 7 depicts, on the left, a conventional ToF illuminator 50 in which a scene is illuminated in a predetermined pattern (i.e., each cell of a grid is illuminated one after another). Hence, energy is allocated statically in time and space.

On the right, Fig. 7 depicts a ToF illuminator 55 according to the present disclosure in which the cells of the grid are illuminated based on the events detected in an EVS. Thereby, energy is allocated dynamically according to scene conditions.

Fig. 8 depicts an embodiment of an AR/VR (augmented reality/virtual reality) system 60 used a head-mounted device. The system 60 includes object tracking circuitry according to the present disclosure, a ToF unit (shown as Tx and Rx, as discussed above), and an EVS unit. The EVS unit includes two input regions 61 (two EVS sensors), which each have a distance “b” with respect to the ToF Tx (wherein, in some embodiments, the distances of the two input regions may differ).

According to the present embodiment, due to the two input regions 61, two monocular tracking regions (right monocular 2.5D tracking region 62 and left monocular 2.5D tracking region 63) are defined. For example, due to the left and right tracking regions 62 and 63, improved tracking of a left hand and a right hand is possible.

Moreover, a center 3D tracking region 64 is defined due to the ToF unit, thereby providing for precise and fast 3D (hand) tracking, which may be required for immersive virtual object interaction, such as virtual keyboard, or the like.

In some embodiments, if the EVS region have overlap, stereo acquisition may be envisaged.

Fig. 9a depicts an embodiment in which a head-mounted device (HMD) 70 tracks and illuminates a hand 71 of a user wearing the HMD 70. When it is recognized, based on EVS data, that the hand 71 is far away from a virtual object 72, ToF acquisition is omitted, as shown on the left of Fig. 9a.

On the right, it is depicted that, when it is recognized, based on the EVS data, that the hand 71 comes close (e.g., below a predetermined distance) to the virtual object, active illumination of the hand 71, and thus, a ToF measurement, is carried out. In contrast to this, Fig. 9b depicts the case that, when it is recognized, based on the EVS data, that the hand 71 comes close to the virtual object 72, the area of the virtual object 72 is actively illuminated.

However, in some embodiments, both areas may be illuminated, i.e., the embodiments of Figs. 9a and 9b may be combined.

Fig. 10 depicts an object tracking method 80 according to the present disclosure in a block diagram.

At 81, ToF data is obtained, as discussed herein, i.e., a scene is illuminated with light spots/dots and a ToF measurement is carried out.

At 82, image data are obtained. In this embodiment, the image data are event sensing data, as discussed herein.

At 83, object data are generated with a neural network, as discussed herein.

Fig. 11 depicts, in a block diagram, a further embodiment of an object tracking method 90 according to the present disclosure, which is different than the object tracking method 80 of Fig. 10 in that, after the obtaining of image data, at 93 a correspondence between the ToF data and the image data is determined, based on a coordinate transformation, as discussed herein.

Fig. 12 depicts a further embodiment of an object tracking method according to the present disclosure in a block diagram. In this embodiment, it is decided, based on the image data, whether ToF data should be obtained.

At 101, image data are obtained, as discussed herein, as event sensing data.

At 102, based on the image data, it is determined whether there is an indication for a ToF measurement, i.e., it is determined whether a tracked object is within a predetermined distance to a virtual object.

If no, ToF data are not obtained.

If yes, at 103, ToF data are obtained.

At 104, object data are generated, as discussed herein.

It should be recognized that the embodiments describe methods with an exemplary ordering of method steps. The specific ordering of method steps is however given for illustrative purposes only and should not be construed as binding. For example, the ordering of 81 and 82 in the embodiment of Fig. 10 may be exchanged. Also, the ordering of 91 and 92, as well as 93 and 94 in the embodiment of Fig. 11 may be exchanged. The methods described herein can also be implemented as a computer program causing a computer and/or a processor to perform the method, when being carried out on the computer and/or processor. In some embodiments, also a non-transitory computer-readable recording medium is provided that stores therein a computer program product, which, when executed by a processor, such as the processor described above, causes the method described to be performed.

All units and entities described in this specification and claimed in the appended claims can, if not stated otherwise, be implemented as integrated circuit logic, for example on a chip, and functionality provided by such units and entities can, if not stated otherwise, be implemented by software.

In so far as the embodiments of the disclosure described above are implemented, at least in part, using software-controlled data processing apparatus, it will be appreciated that a computer program providing such software control and a transmission, storage or other medium by which such a computer program is provided are envisaged as aspects of the present disclosure.

Note that the present technology can also be configured as described below.

(1) Object tracking circuitry configured to: obtain time-of-flight data, based on a time-of-flight measurement, the time-of-flight data indicating the object; obtain image data indicating the object and at least one light spot, the light spot originating from light emitted for the time-of-flight measurement; and generate object data based on the time-of-flight data, on the image data, and on an association between the time-of-flight data and the image data, the association being based on the at least one light spot detected in the image data.

(2) The object tracking circuitry of (1), wherein the image data include event sensing data.

(3) The object tracking circuitry of (1) or (2), wherein the association is carried out of each time-of-flight measurement of a plurality of time-of-flight measurement.

(4) The object tracking circuitry of anyone of (1) to (3), wherein the association is an association in time.

(5) The object tracking circuitry of anyone of (1) to (4), wherein the time-of-flight data have a first capture rate, wherein the image data have a second capture rate, and wherein the second capture rate is greater than the first capture rate, and wherein the association is further based on the first and the second capture rate.

(6) The object tracking circuitry of anyone of (1) to (5), further configured to: determine a correspondence of a first coordinate, indicated by the time-of-flight data, to a second coordinate, indicated by the image data.

(7) The object tracking circuitry of anyone of (1) to (6), wherein the association is further based on an emission frequency of the light emitted for the time-of-flight measurement.

(8) The object tracking circuitry of anyone of (1) to (7), wherein the association is further based on a neural network.

(9) The object tracking circuitry of anyone of (1) to (8), wherein the object data are used for hand pose estimation.

(10) The object tracking circuitry of anyone of (1) to (9), further configured to: carry out a time-of-flight measurement when it is indicated, in the image data, that a predetermined condition is fulfilled.

(11) An object tracking method comprising: obtaining time-of-flight data, based on a time-of-flight measurement, the time-of-flight data indicating the object; obtaining image data indicating the object and at least one light spot, the light spot originating from light emitted for the time-of-flight measurement; and generating object data based on the time-of-flight data, on the image data, and on an association between the time-of-flight data and the image data, the association being based on the at least one light spot detected in the image data.

(12) The object tracking method of (11), wherein the image data include event sensing data.

(13) The object tracking method of (11) or (12), wherein the association is carried out of each time-of-flight measurement of a plurality of time-of-flight measurement.

(14) The object tracking method of anyone of (11) to (13), wherein the association is an association in time.

(15) The object tracking method of anyone of (11) to (14), wherein the time-of-flight data have a first capture rate, wherein the image data have a second capture rate, and wherein the second capture rate is greater than the first capture rate, and wherein the association is further based on the first and the second capture rate.

(16) The object tracking method of anyone of (11) to (15), further comprising: determining a correspondence of a first coordinate, indicated by the time-of-flight data, to a second coordinate, indicated by the image data. (17) The object tracking method of anyone of (11) to (16), wherein the association is further based on an emission frequency of the light emitted for the time-of-flight measurement.

(18) The object tracking method of anyone of (11) to (17), wherein the association is further based on a neural network. (19) The object tracking method of anyone of (11) to (18), wherein the object data are used for hand pose estimation.

(20) The object tracking method of anyone of (11) to (19), further comprising: carrying out a time-of-flight measurement when it is indicated, in the image data, that a predetermined condition is fulfilled. (21) A computer program comprising program code causing a computer to perform the method according to anyone of (11) to (20), when being carried out on a computer.

(22) A non-transitory computer-readable recording medium that stores therein a computer program product, which, when executed by a processor, causes the method according to anyone of (11) to (20) to be performed.

Claims

1. Object tracking circuitry configured to: obtain time-of-flight data, based on a time-of-flight measurement, the time-of-flight data indicating the object; obtain image data indicating the object and at least one light spot, the light spot originating from light emitted for the time-of-flight measurement; and generate object data based on the time-of-flight data, on the image data, and on an association between the time-of-flight data and the image data, the association being based on the at least one light spot detected in the image data.

2. The object tracking circuitry of claim 1, wherein the image data include event sensing data.

3. The object tracking circuitry of claim 1, wherein the association is carried out of each time-of-flight measurement of a plurality of time-of-flight measurement.

4. The object tracking circuitry of claim 1, wherein the association is an association in time.

5. The object tracking circuitry of claim 1, wherein the time-of-flight data have a first capture rate, wherein the image data have a second capture rate, and wherein the second capture rate is greater than the first capture rate, and wherein the association is further based on the first and the second capture rate.

6. The object tracking circuitry of claim 1, further configured to: determine a correspondence of a first coordinate, indicated by the time-of-flight data, to a second coordinate, indicated by the image data.

7. The object tracking circuitry of claim 1, wherein the association is further based on an emission frequency of the light emitted for the time-of-flight measurement.

8. The object tracking circuitry of claim 1, wherein the association is further based on a neural network.

9. The object tracking circuitry of claim 1, wherein the object data are used for hand pose estimation.

10. The object tracking circuitry of claim 1, further configured to: carry out a time-of-flight measurement when it is indicated, in the image data, that a predetermined condition is fulfilled.

11. An object tracking method comprising: obtaining time-of-flight data, based on a time-of-flight measurement, the time-of-flight data indicating the object; obtaining image data indicating the object and at least one light spot, the light spot originating from light emitted for the time-of-flight measurement; and generating object data based on the time-of-flight data, on the image data, and on an association between the time-of-flight data and the image data, the association being based on the at least one light spot detected in the image data.

12. The object tracking method of claim 11, wherein the image data include event sensing data.

13. The object tracking method of claim 11, wherein the association is carried out of each time-of-flight measurement of a plurality of time-of-flight measurement.

14. The object tracking method of claim 11, wherein the association is an association in time.

15. The object tracking method of claim 11, wherein the time-of-flight data have a first capture rate, wherein the image data have a second capture rate, and wherein the second capture rate is greater than the first capture rate, and wherein the association is further based on the first and the second capture rate.

16. The object tracking method of claim 11, further comprising: determining a correspondence of a first coordinate, indicated by the time-of-flight data, to a second coordinate, indicated by the image data.

17. The object tracking method of claim 11, wherein the association is further based on an emission frequency of the light emitted for the time-of-flight measurement.

18. The object tracking method of claim 11, wherein the association is further based on a neural network.

19. The object tracking method of claim 11, wherein the object data are used for hand pose estimation.

20. The object tracking method of claim 11, further comprising: carrying out a time-of-flight measurement when it is indicated, in the image data, that a predetermined condition is fulfilled.