WO2019015761A1

WO2019015761A1 - Electronic device, system and method for determining the pose of an object

Info

Publication number: WO2019015761A1
Application number: PCT/EP2017/068388
Authority: WO
Inventors: Sven Meier; Norimasa Kobori; Wadim KEHL
Original assignee: Toyota Motor Europe
Priority date: 2017-07-20
Filing date: 2017-07-20
Publication date: 2019-01-24
Also published as: JP2020527270A; JP6955081B2

Abstract

The invention relates to an electronic device (1) for determining the pose of an object. The electronic device is configured to: • receive 3D image data of an optical sensor (3), the 3D image data representing the object (O) in a scene, • estimate the object pose with respect to the optical sensor position based on the 3D image data, • identify the closest one from a set of predetermined view positions of a predetermined 3D object model based on the estimated object pose, and • determine the pose of the object in the scene based on the identified closest view position. The invention further relates to a system and a method.

Description

Electronic device, system and method for determining the pose of an object

FIELD OF THE DISCLOSURE

[0001] The present disclosure is related to an electronic device, system and method for determining the pose of an object, in particular for recognizing the pose of a non-stationary object in a scene, where the object pose is not predefined. BACKGROUND OF THE DISCLOSURE

[0002] Automation is becoming increasingly important in many fields what also implies the increased need for robotics. Meanwhile robotic systems have become common in the industrial field, their use is still rather uncommon in the environment of a domestic dwelling, e.g. to serve individual users in daily life. However, also in this field there is a high demand for robotic systems. For example, a robotic system may help an aged person to find and retrieve a specific object, e.g. a pencil.

[0003] One problem of the use of robotic systems in a domestic dwelling is that in contrast to an industrial application, many tasks cannot be standardized, i.e. predefined and tightly controlled. Thus the robotic system has to be capable to carry out individually varying tasks. Moreover the operating conditions in a domestic dwelling are more challenging, e.g. lighting, object disposition etc. Also in other areas, where robotic systems are used, the pose of an object of interest can be unknown, i.e. is not mandatorily predefined.

[0004] An important aspect of a robotic system is therefore its capability to find and recognize a specific object, which may be positioned in any location and in any orientation. For this purpose the robotic system may comprise an optical sensor and may be moveable, e.g. may have drivable wheels.

[0005] A further challenge is to determine the pose of an object, e.g. an object which has been recognized in a sensed scene. Determining the exact pose, (in particular the 6D pose) can be advantageous for example, in case the object shall be picked up or shall be manipulated otherwise by a robotic system.

[0006] The iterative closest point (ICP) algorithm is one example of an algorithm for matching two shapes such as point clouds or normal etc., i.e. 3D RGB data. This type of data is commonly used in robotics applications and constitutes a key component algorithm used for pose refinement during object recognition.

[0007] In known object recognition the ICP algorithm is used to match a point cloud coming from the robot's 3D (RGB-D) sensor with a point cloud of a previously known model of a target object. The output of the algorithm is the transformation of the model that best fits the observed data. This is referred to as model-based ICP. Accordingly, in such object detection scenarios, one point cloud comes from a model hypothesis and needs to be aligned to point cloud data from the sensor, hereafter referred to as the 'scene'.

[0008] However, such a conventional approach is computationally costly in finding correspondences between points in the source and destination point clouds as it requires calculating distances between all point pairs to find the closest match. In particular, the use in real-time scenarios can be difficult, as the rendering steps are costly to perform and therefore may cause undesired delays.

[0009] An object recognition ICP technique using a point-to-plane approach is known e.g. from:

Yang Chen, and Gerard Medioni: Object Modeling by Registration of Multiple Range Images. International Journal of Image and Vision Computing, 10(3), pp. 145-155, 1992.

[0010] In order to accelerate this comparison method, it is known to use a KD-tree (k-dimensional tree) algorithm. However, the use of a KD-tree itself requires building said tree, which is also computationally costly in terms of memory and time. An example can be found in:

K. Tateno, D. Kotake, and S. Uchiyama. Model-based 3D Object Tracking with Online Texture Update. In MVA, 2009.

[0011] US2012114251 (Al) discloses a system for object recognition of a 3D object of a certain object class using a statistical shape model for recovering 3D shapes from a 2D representation of the 3D object and comparing the recovered 3D shape with known 3D to 2D representations of at least one object of the object class.

SUMMARY OF THE DISCLOSURE

[0012] Currently, it remains desirable to provide an electronic device, system and method for determining the pose of an object requiring a reduced computation effort, in particular during the object pose recognition, e.g. in a real-time scenario.

[0013] Therefore, according to the embodiments of the present disclosure, an electronic device for determining the pose of an object is provided. The electronic device being configured to:

receive 3D image data of an optical sensor,

- estimate the object pose with respect to the optical sensor position based on the 3D image data,

identify the closest one from a set of predetermined view positions of a predetermined 3D object model based on the estimated object pose, and

- determine the pose of the object in the scene based on the identified closest view position.

[0014] By providing such an electronic device, a faster implementation of an algorithm for matching two shapes such as point clouds or normal etc., i.e. 3D RGB data, is provided. This type of data is commonly used in robotics applications and constitutes a key component in algorithms used for pose refinement during object recognition.

[0015] In other words, according to the present disclosure the model is desirably pre-rendered, in order to obtain the predetermined view positions (and hence the predetermined views) of the model. Consequently, in pose recognition this data is desirably used for pose refinement without need for further time costly rendering. This makes this method particularly interesting for real-time scenarios, as costly rendering steps can be reduced or even avoided during object pose recognition.

[0016] Accordingly, the concept and algorithm proposed by the present disclosure may be used in ICP instead of the conventional KD-tree technique. In particular, the concept and algorithm of the present disclosure may be used for retrieval of the correspondences between the point clouds of the sensed object and the object model.

[0017] The optical sensor desirably senses the scene in which the object is.

[0018] By estimating the object pose with respect to the optical sensor position, the corresponding optical sensor position with respect to the object pose (i.e. in the object space) is estimated desirably at the same time.

[0019] The identification of the closest one from a set of predetermined view positions of a predetermined 3D object model is based on the estimated object pose. It may further be based on the 3D image data representing the object (0) and/or pre-rendered image data of the object model, when seen from the different view positions. In particular, the 3D image data representing the object (0) may be compared with the image data (one data set for each view, comparison with each data set) of the object model. The estimated object pose may be used in the identification process as a guide to find the pose to be determined and/or as a starting point. A predetermined view position is desirably a predetermined point of view from which the object is seen. The predetermined view positions may be distributed on equal distances to each other on one or several orbits of the object or on a sphere of the object. There may be e.g. several hundred (e.g. more than 300) view positions.

[0020] The 3D image data of the optical sensor (3) may comprise a point cloud, and/or the 3D object model comprises a point cloud.

[0021] Accordingly, in order to identify the closest view and/or determining the object pose, these point clouds may be compared to eachother or a data set of the rendered point cloud of the 3D image data may be compared to data sets of the pre-rendered views of the point cloud of the model.

[0022] Estimating the object pose may comprise: determining a pose hypothesis of the object by estimating its pose in the scene, and estimating the object pose with respect to the optical sensor position based on the pose hypothesis.

[0023] Estimating the object pose or receiving the 3D image data may comprise: recognizing the object in the scene based on the received 3D image data.

[0024] Accordingly, an object recognition is performed in the beginning of the process, e.g. when the 3D image is received or during (or before) the identification of the closest view is performed. Alternatively, the device may receive the 3D image data, wherein these data already comprise information regarding a recognized object.

[0025] Identifying the closest one from the set of predetermined view positions may be based on determining the sensor position (i.e. its pose or 6D pose) in object space and finding the best matching view thereto from the set of predetermined view positions.

[0026] Accordingly, a simple comparison of the (desirably rendered) data of the sensor view with (pre-) rendered data of the model in the different predetermined views may be performed, in order to identify the closest one of the predetermined view positions.

[0027] Each of the set of predetermined view positions may be linked to a recoded data set, said recoded data set representing rendered image data of the object model when seen from the view position.

[0028] For example, the sets of predetermined view positions may be linked to the recoded data sets in one or several look-up table (e.g. one per view position). The electronic device may comprise a data storage providing said look-up table and/or providing the sets of predetermined view positions linked to the recoded data sets. A look-up table may also comprise the recoded data set or sets.

[0029] The rendered image data of a recoded data set may comprise: a subsampled point cloud of the object model, a subsampled contour of the model, and/or a subsampled surface model of the model.

[0030] Identifying the closest one from the set of predetermined view positions may comprise: For each of the predetermined view positions project the rendered image data of the linked recoded data set into the scene, compare the rendered image data with the 3D image data representing the object in the scene, and determine, for which of the predetermined view positions the deviation between the rendered image data and the 3D image data representing the object in the scene reaches a minimum.

[0031] Said deviation may also be referred to as an error which is to be minimized.

[0032] The image data may comprise a pair of a visible light image and a depth image. These data may be the input data to the device.

[0033] The visible light image may comprise the visible part of the electromagnetic spectrum, in particular decomposed into the three bands (RGB: red, green, blue) processed by the human vision system.

[0034] The object pose may be a 6D pose comprising x, y, z location information and θ, φ, ψ rotation information.

[0035] More generally, the object pose may be a mathematical description of the location and orientation of an object in a coordinate system.

[0036] Determining the pose of the object in the scene may comprise: determining the θ, φ, ψ rotation information based on the identified closest view from the set of predetermined view positions of the predetermined 3D object model, and/or determining the x, y, z location information based on projecting the model in its closest view into the scene and comparing projected model with the 3D image data representing the object in the scene.

[0037] The disclosure further relates to a system for determining the pose of an object, the system comprising:

• an electronic device, in particular as described above, and

• an optical sensor configured to sense the object, the sensor being in particular a 3D camera or a stereo camera.

[0038] Accordingly, the system may be configured to autonomously recognize and locate an object and in particular to determine the pose of said object. For example it may be realized as a moveable robotic system, e.g. with means for retrieving the object.

[0039] The disclosure further relates to a method of determining the pose of an object. The method comprises the steps of:

· receiving 3D image data of an optical sensor (3), the 3D image data representing the object (O) in a scene,

• estimating the object pose with respect to the optical sensor position based on the 3D image data,

• identifying the closest one from a set of predetermined view positions of a predetermined 3D object model based on the estimated object pose, and

• determining the pose of the object in the scene based on the identified closest view position.

[0040] The method may comprise the further steps of: determining a plurality of view positions of the object model forming the set of predetermined view positions, for each view position of the set of predetermined view positions: determining a recoded data set, said recoded data set representing rendered image data of the object model when seen from the view position, and linking the view position to the recoded data set.

[0041] Accordingly, the set of predetermined view positions and/or the related recoded data sets may be predetermined. These data may be used in the object pose determination method, i.e. during use for object pose recognition. [0042] The method may comprise further method steps which correspond to the functions of the electronic device as described above. The further desirable method steps are described in the following.

[0043] The 3D image data of the optical sensor (3) may comprise a point cloud, and/or the 3D object model comprises a point cloud.

[0044] The step of estimating the object pose may comprise: determining a pose hypothesis of the object by estimating its pose in the scene, and estimating the object pose with respect to the optical sensor position based on the pose hypothesis.

[0045] The step of estimating the object pose or receiving the 3D image data may comprise: recognizing the object in the scene based on the received 3D image data.

[0046] The step of identifying the closest one from the set of predetermined view positions may be based on determining the sensor position (i.e. its pose or 6D pose) in object space and finding the best matching view thereto from the set of predetermined view positions.

[0047] Each of the set of predetermined view positions may be linked to a recoded data set, said recoded data set representing rendered image data of the object model when seen from the view position.

[0048] The rendered image data of a recoded data set may comprise: a subsampled point cloud of the object model, a subsampled contour of the model, and/or a subsampled surface model of the model.

[0049] The step of identifying the closest one from the set of predetermined view positions may comprise: For each of the predetermined view positions project the rendered image data of the linked recoded data set into the scene, compare the rendered image data with the 3D image data representing the object in the scene, and determine, for which of the predetermined view positions the deviation between the rendered image data and the 3D image data representing the object in the scene reaches a minimum.

[0050] The image data may comprise a pair of a visible light image and a depth image. These data may be the input data to the device.

[0051] The object pose may be a 6D pose comprising x, y, z location information and θ, φ, ψ rotation information.

[0052] The step of determining the pose of the object in the scene may comprise: determining the θ, φ, ψ· rotation information based on the identified closest view from the set of predetermined view positions of the predetermined 3D object model, and/or determining the x, y, z location information based on projecting the model in its closest view into the scene and comparing projected model with the 3D image data representing the object in the scene.

[0053] It is intended that combinations of the above-described elements and those within the specification may be made, except where otherwise contradictory.

[0054] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure, as claimed.

[0055] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, and serve to explain the principles thereof. BRIEF DESCRIPTION OF THE DRAWINGS

[0056] Fig. 1 shows a block diagram of a system with an electronic device according to embodiments of the present disclosure;

[0057] Fig. 2a and 2b show an exemplary scene (fig. 2a) where the pose of an object is determined by using a pre-rendered object model (fig. 2b) according to embodiments of the present disclosure;

[0058] Fig. 3 shows a schematic flow chart illustrating an exemplary method of a (preparatory / pre-rendering) offline-processing of an object model;

[0059] Fig. 4 shows a schematic flow chart illustrating an exemplary method of a (in use) online-processing of image data, where an object pose is determined; and

[0060] Fig. 5 shows a schematic flow chart illustrating an exemplary method of updating a pose hypothesis used in the method of fig. 4.

DESCRIPTION OF THE EMBODIMENTS

[0061] Reference will now be made in detail to exemplary embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.

[0062] Fig. 1 shows a block diagram of a system 30 with an electronic device 1 according to embodiments of the present disclosure. The system may comprise a robotic system 10, which may have various functions. For example it may be moveable, e.g. has drivable wheels, and it may have means for retrieving an object, e.g. at least one gripper.

[0063] The electronic device 1 carries out a computer vision algorithm for detecting the presence and location (in particular the pose) of objects in a scene. Robotic systems require this information to be able to find, locate and manipulate objects. The input to the electronic device 1 is a pair of visible light (RGB) and depth images (D). The output of the electronic device 1 is the 6D pose (x, y, z location and θ, φ, ψ rotation around x, y, z) of the target object.

[0064] The electronic device 1 is connected to or comprises a data storage 2. Said data storage may be used to store a target object in the form of a 3D model file providing shape (3D) and appearance (color) information of the object.

[0065] The electronic device 1 may additionally carry out further functions in the system 30. For example, the electronic device may also act as the general purpose ECU (electronic control unit) of the system. The electronic device 1 may comprise an electronic circuit, a processor (shared, dedicated, or group), a combinational logic circuit, a memory that executes one or more software programs, and/or other suitable components that provide the described functionality. In other words, device 1 may be a computer device.

[0066] Device 1 may be external to the (movable) robotic system 10 which is configured to find and retrieve the object. In other words, the computational resources on board of the robotic system 10 may be limited, e.g. they may only transmit the 3D data to the external (and e.g. stationary) electronic device 1, e.g. over wifi. The result determined by the device 1 may be sent back to the robot.

[0067] The electronic device 1 is further connected to an optical sensor 3, in particular a 3D digital camera 3, e.g. a stereo camera or e.g. a Microsoft Kinect ® camera. The electronic device 1 and the digital camera may be comprised by a robotic system 10. The digital camera 3 is configured such that it can record a 3 dimensional scene, and in particular output digital data providing shape (3D) and appearance (color) information of the scene.

[0068] The output of the digital camera 3 is transmitted to the electronic device 1. Desirably, the output is transmitted instantaneously, i.e. in real time or in quasi real time. Hence, a searched object can also be recognized and located (i.e. its pose determined) by the electronic device in real time or in quasi real time.

[0069] The system 30 may comprise additionally a server 20. The server 20 may be used to do the (preparatory / pre-rendering) offline-processing of an object model, as shown e.g. in fig. 3. The pre-rendered (i.e. recoded) data may then be stored on the server or may be provided to the electronic device. For this purpose, the electronic device 1 may be connectable to the server. For example the electronic device 1 may be connected to the server 20 via a wireless connection. Alternatively or additionally the electronic device 1 may be connectable to the server 20 via a fixed connection, e.g. via a cable. It is also possible that a data transfer between the electronic device 1 and server 20 is achieved by using a portable data storage, e.g. an USB stick. Alternatively, the processing of the server may also be performed by the electronic device 1 itself.

[0070] In the following the principle concept and algorithm of the present disclosure is described with reference to fig. 2 to 5.

[0071] The present disclosure proposes desirably an improved (i.e. accelerated) implementation of the iterative closest point (ICP) algorithm usinge.g. look-up tables. ICP is a commonly used algorithm for the alignment of two point clouds. In object detection scenarios, one point cloud comes from a model hypothesis and needs to be aligned to point cloud data from the sensor, hereafter referred to as the 'scene'.

[0072] Fig. 2a and 2b show an exemplary scene (fig. 2a) where the pose of an object is determined by using a pre-rendered object model (fig. 2b) according to embodiments of the present disclosure.

[0073] The image data representing the sensed object O in a scene S in fig. 2a and the object model M in fig. 2b comprise or consist each of point clouds (as schematically indicated in fig. 2a and 2b). Data from the point cloud (e.g. contour, and surface normal) may be subsampled to reduce computational cost. This is shown as black dots fig. 2a and 2b.

[0074] In detail, fig. 2a shows the current object hypothesis and fig. 2b the closest pre-rendered viewpoint (i.e. view position) each with contour and interior sampling points. The black dots on the interior surface area are used to establish the correspondences, i.e. the black dots in the left view are corresponding points d_i from the scene and the black points in the right view are the corresponding source points s_i from the model. [0075] As shown, the scene of fig. 2a sensed by the image sensor is upside- down and hence not in the same orientation as the object model of fig. 2b. Accordingly, as in this example, the problem may occur that the closest pre- rendered viewpoint can be determined but the pre-rendered model view linked to said view point is rotated in the viewed plane, i.e. e.g. upside down, with regard to the sensed object.

[0076] For this reason, a cheap yet very effective approximation of the model render space may be used that avoids both online rendering and contour extraction. In an offline stage, equidistantly viewpoints v_i may be sampled on a unit sphere around the object model, from each rendered and the 3D contour points extracted to store view-dependent sparse 3D sampling sets in local object space. Since these points are desirably utilized in 3D space, it is neither needed to sample in scale nor for different inplane rotations. Finally, as shown in fig. 2a and 2b, also for each contour point its 2D gradient orientation is desirably stored and a set of interior surface points with their normal may be stored.

[0077] In brief, the contour points may be utilized to provide a pose determination process which is independent of the rotation of the sensed object in the viewed plane with regard to the object model, as shown in the example of fig. 2a, 2b.

[0078] Fig. 3 shows a schematic flow chart illustrating an exemplary method of a (preparatory / pre-rendering) offline-processing of an object model.

[0079] The object model can be pre-rendered from a dense set of viewpoints and all necessary information such as contour, point cloud and/or surface normal data or their combinations can be extracted. This data is stored together with the 3D unit vertex positions v_i (i.e. the view position). Each v_i then holds a reference to its own data in its own local reference frame. This data is provided as recoded data to the (in use) online-processing of image data as shown in fig. 4.

[0080] Fig. 4 shows a schematic flow chart illustrating an exemplary method of a (in use) online-processing of image data, where a object pose is determined.

[0081] In fig. 4, first the object is recognized. In this context it is noted that fig. 4 shows an example of a (complete) process of recognizing and locating (i. e. determining the pose) of an object. However, the recognition step may be not a part of the present disclosure but only a preceding process.

[0082] Subsequently a pose hypothesis is determined. If pose refinement is needed, the current pose hypothesis may be provided with a rotation matrix R and translation vector t in camera space (see "Pose hypothesis" in fig. 5). The camera position in object space may be retrieved via (Equation 1):

O := -ti^rt

Further, based on the recoded data (of fig. 3) and the estimated camera position (or the corresponding estimated object position with respect to the camera), the closest view (which is a 3D unit vector) is determined (Equation 2):

V^* := argmax( V 0/\ \0\ \ >

i

[0083] By finding the closest view, the 3D (θ, φ, ψ) rotation information of the object pose can be determined.

[0084] To formalize, the model pose [R; t] is provided during tracking, wherein rendering by computing the camera position in object space O (Equation 1) can be avoided. In Equation 2, it is desirably normalized to unit length, wherein the closest viewpoint V* can be quickly found via dot products.

[0085] For all of the data in V* it is forward-projected into the scene such that each point s from the source view (pre-rendered) is identified with a corresponding point d in the destination. Then it can be proceeded with solving the established closed-form solution with a singular value decomposition (SVD).

[0086] Since each s_i (source points from the model) are stored with V*, each s_i can be transformed into the scene with the current hypothesis p, which is 3d vertex vector, cf. (Equation 3): p := R*s_i + t

[0087] This only brings the point into the scene but does not determine which point d_i is yet. For this, the transformed point p is back projected into the image plane and looked up in the scene point cloud at that image position. So d_i can be determined by referring to the point cloud in the scene in one operation requiring constant time. This process requires less computational cost than KD-trees, which is conventionally used in ICP for retrieval of the correspondences between sj and dj.

[0088] In more detail, and as shown in fig. 4, the 3D point p is projected back to the image plane (assuming that the camera image is 2D plus a depth component).

[0089] Accordingly the 2D location of the point may be determined in (x,y) as x_m = f (X/Z) and y_m = f (Y/Z), where f is the focal length of the camera.

[0090] In a subsequent step the depth information z may be determined in

(x,y).

[0091] As a result the complete 6D pose can be determined comprising x, y, z location information and θ, φ, ψ rotation information.

[0092] As an additional step, each correspondence is only taken into account for the energy function which minimizes equation 4 if the distance between projected source cloud point and destination cloud point | |(R*s+t) -d| | is smaller than a threshold tau (equation 4):

∑| | (R*s_i+t) - d_i II < tau

[0093] All of the above mentioned steps may be run with tau decaying with each iteration thus rendering the algorithm robust to occlusion and outliers.

[0094] It is further noted that desirably DATA_A is any type of data set. In Fig. 3 it is shown the subsampled point cloud, contour, and surface normal, but in the SVD during on-line processing, it may be used only "point cloud" information, and the "contour", and "surface normal" may be not used. What kind of the data-set should be stored desirably depends on the optimization algorithm of ICP and for this patent we do not specify the items of DATA_A.

[0095] Any type of object recognition algorithm in on-line processing may be used in the present disclosure.

[0096] The 3D model may be stored as in any kind, e.g. in a file. It desirably consists of a point cloud, or normal, surface or color information or any combination of these. The format type may be that one of DATA_A.

[0097] Fig. 5 shows a schematic flow chart illustrating an exemplary method of updating a pose hypothesis used in the method of fig. 4. This process may be additionally applied, if pose refinement is needed. In this process, the SVD may be applied based on the current pose hypothesis, the source point s and the destination point d, in order to update the hypothesis. A possible SVD evaluation function may be equation 4.

[0098] Different Updating techniques of the pose hypothesis in on-line processing are possible, as any type of ICP optimization can work in the present disclosure.

[0099] Throughout the description, including the claims, the term "comprising a" should be understood as being synonymous with "comprising at least one" unless otherwise stated. In addition, any range set forth in the description, including the claims should be understood as including its end value(s) unless otherwise stated. Specific values for described elements should be understood to be within accepted manufacturing or industry tolerances known to one of skill in the art, and any use of the terms "substantially" and/or "approximately" and/or "generally" should be understood to mean falling within such accepted tolerances.

[0100] Although the present disclosure herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present disclosure.

[0101] It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims.

Claims

2019/015761 15 CLAIMS

1. An electronic device (1) for determining the pose of an object, the electronic device being configured to:

receive 3D image data of an optical sensor (3), the 3D image data representing the object (0) in a scene (S),

estimate the object pose (x, y, z, θ, φ, ψ) with respect to the optical sensor position based on the 3D image data,

identify the closest one from a set of predetermined view positions of a predetermined 3D object model based on the estimated object pose, and determine the pose of the object in the scene based on the identified closest view position.

2. The electronic device (1) according to claim 1, wherein

the 3D image data of the optical sensor (3) comprises a point cloud, and/or the 3D object model comprises a point cloud.

3. The electronic device (1) according to claim 1 or 2,

wherein estimating the object pose comprises:

determining a pose hypothesis of the object by estimating its pose in the scene, and

estimating the object pose with respect to the optical sensor position based on the pose hypothesis.

4. The electronic device (1) according to any one of the preceding claims, wherein estimating the object pose or receiving the 3D image data comprises: recognizing the object in the scene based on the received 3D image data.

5. The electronic device (1) according to any one of the preceding claims, wherein

identifying the closest one from the set of predetermined view positions is based on determining the sensor position in object space and finding the best matching view thereto from the set of predetermined view positions.

6. The electronic device (1) according to any one of the preceding claims, wherein

each of the set of predetermined view positions is linked to a recoded data set, said recoded data set representing rendered image data of the object model when seen from the view position.

7. The electronic device (1) according to claim 6, wherein

the rendered image data of a recoded data set comprises:

a subsampled point cloud of the object model,

a subsampled contour of the model, and/or

a subsampled surface model of the model.

8. The electronic device (1) according to the preceding claim, wherein identifying the closest one from the set of predetermined view positions comprises:

for each of the predetermined view positions project the rendered image data of the linked recoded data set into the scene,

compare the rendered image data with the 3D image data representing the object in the scene, and

determine, for which of the predetermined view positions the deviation between the rendered image data and the 3D image data representing the object in the scene reaches a minimum.

9. The electronic device (1) according to any one of the preceding claims, wherein

the image data comprises a pair of a visible light image and a depth image.

10. The electronic device (1) according to any one of the preceding claims, wherein

the visible light image comprises the visible part of the electromagnetic spectrum, in particular decomposed into the three bands (RGB) processed by the human vision system.

11. The electronic device (1) according to any one of the preceding claims, wherein the object pose is a 6D pose comprising x, y, z location information and/or θ, φ, ψ rotation information.

12. The electronic device (1) according to any one of the preceding claims, wherein

determining the pose of the object in the scene comprises:

determining the θ, φ, ψ rotation information based on the identified closest view from the set of predetermined view positions of the predetermined 3D object model, and/or

determining the x, y, z location information based on projecting the model in its closest view into the scene and comparing projected model with the 3D image data representing the object in the scene.

13. A system (30) for determining the pose of an object, the system comprising:

an electronic (1) device according to any one of the preceding claims, and an optical sensor configured to sense the object, the sensor being in particular a 3D camera or a stereo camera.

14. A method of for determining the pose of an object,

the method comprising the steps of:

receiving 3D image data of an optical sensor (3), the 3D image data representing the object (O) in a scene (S),

estimating the object pose (x, y, z, θ, φ, ψ) with respect to the optical sensor position based on the 3D image data,

identifying the closest one from a set of predetermined view positions of a predetermined 3D object model based on the estimated object pose, and determining the pose of the object in the scene based on the identified closest view position.

15. The method according to claim 14,

the method comprising the further steps of:

determining a plurality of view positions of the object model forming the set of predetermined view positions,

for each view position of the set of predetermined view positions: determining a recoded data set, said recoded data set representing rendered image data of the object model when seen from the view position, and linking the view position to the recoded data set.