WO2020156836A1

WO2020156836A1 - Dense 6-dof pose object detector

Info

Publication number: WO2020156836A1
Application number: PCT/EP2020/051136
Authority: WO
Inventors: Sergey Zakharov; Slobodan Ilic; Ivan SHUGUROV; Andreas Hutter
Original assignee: Siemens Aktiengesellschaft
Priority date: 2019-02-01
Filing date: 2020-01-17
Publication date: 2020-08-06
Also published as: US20220101639A1; US11915451B2; EP3903226A1; CN113614735A

Abstract

The invention provides a method and a system for object detection and pose estimation within an input image (1), as well as several interrelated methods and systems. A 6-degree of-freedom object detection and pose estimation is performed using a trained encoder-decoder convolutional artificial neural network (20) comprising an encoder head (22), an ID mask decoder head (24), a first correspondence color channel decoder head (26) and a second correspondence color channel de- coder head (28). The ID mask decoder head (24) creates an ID mask for identifying objects, and the color channel decoder heads (26, 28) are used to create a 2D-to-3D-correspondence map (31). For at least one object (11, 12, 13) identified by the ID mask (34), a pose estimation (51, 52, 53) based on the generated 2D-to-3D-correspondence map (31) and on a pregenerated bijective association of points of the object with unique value combinations in the first and the second correspondence color channels (36, 38) is generated.

Description

Dense 6-DoF Pose Object Detector

The present invention relates to a computer-implemented meth od for object detection and pose estimation within an input image, to a system for object detection and pose estimation within an input image, to a method for providing training da ta for training an artificial intelligence entity for use in said method and/or said system, and to corresponding computer programs and data storage media.

The pose estimation is in particular a 6-degree-of-freedom, 6-DoF pose estimation, wherein 6 degrees of freedom relates to the six degrees of freedom of movement a three-dimensional body has: three linear directions of motion along axes of an orthogonal coordinate system as well as three rotary motions, often designated as rolling, pitching and yawing.

Object detection has always been an important problem

in computer vision and a large body of research has been dedicated to it in the past. With the advent of deep learn ing, new techniques became feasible.

Typically, object detectors localize objects of interest in images in terms of tight bounding boxes around them. However, in many applications, e.g. augmented reality, robotics, ma chine vision etc., this is not enough and a full 6DoF pose (sometimes also called 6D pose) is required.

While this problem is comparatively simple to solve in depth images, the challenges are shifted to procuring depth images of sufficient quality with comparatively little effort. Depth images are created using depth cameras. However, reliable depth cameras are usually expensive and power-hungry. On the other hand, available low-quality depth sensors are prone to many artifacts resulting from the technology itself as well as from the design of the sensors. Moreover, depth cameras are usually quite imprecise, have a limited view range, and are not applicable in outdoor environments.

Apart from depth images, also more conventional RGB images can in principle be used for object detection and pose esti mation. The RGB color model is an additive color model in which Red, Green and Blue are added together in various ways to reproduce a broad array of colors.

In contrast to the problems with depth images, RGB images of high quality are much easier to obtain, due both to the com paratively higher quality of RGB sensors (cameras) at compar atively low cost as well as to the comparatively low power consumption by RGB sensor. However, in RGB images detecting a full 6DoF pose is a challenge due to perspective ambiguities and significant appearance changes of the object when seen from different viewpoints.

Recent deep learning-based approaches based on RGB images in clude works like:

"W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab. SSD-6D: Making RGB—based 3D detection and 6D pose estimation great again. In Proceedings of the International

Conference on Computer Vision (ICCV 2017), Venice, Italy, pages 22—29", hereafter referred to as "SSD6D" for "Single Shot Multibox Detector 6D", or:

"B. Tekin, S. N. Sinha, and P. Fua . Real-Time Seamless Single Shot 6D Object Pose Prediction. Available at

arXiv: 1711.08848v5", hereafter referred to as "Y0L06D" for "You Only Look Once 6D" . The pose estimation in these works is, however, when no additional information is provided, rel atively imprecise.

In the field of detecting people and their poses in images, works like "R. A. Giiler, N. Neverova, and I. Kokkinos.

DensePose: Dense human pose estimation in the wild. Available at arXiv: 1802.00434vl", hereafter referred to as "DensePose", are available. The "DensePose" method estimates dense corre spondences between vertices of a human body model and humans in the image. However, the "DensePose" method requires a so phisticated annotation tool and enormous annotation efforts, which makes the method expensive to employ.

US 2018/137644 A1 describes methods and systems of performing object pose estimations, in which an image including an ob ject is obtained and a plurality of two-dimensional projec tions of three-dimensional bounding boxes of the object in the image are determined. The three-dimensional pose of the object is then estimated using the two-dimensional projec tions of the three-dimensional bounding boxes.

The scientific publication Feng Y., Wu F., Shao X., Wang Y., Zhou X. (2018), "Joint 3D Face Reconstruction and Dense

Alignment with Position Map Regression Network", in: Ferrari V. , Hebert M., Sminchisescu C., Weiss Y. (eds) Computer Vi sion - ECCV 2018, ECCV 2018, Lecture Notes in Computer Sci ence, vol 11218, Springer, Cham, describes 3D face recon struction based on 2D images of faces by using a UV position map which is a 2D image recording 3D positions of all points in a so-called UV space. A weight matrix is applied which weights certain features of the face higher than others when it comes to pose estimation.

It is therefore one of the objects of the present invention to provide a method and a system, and methods for providing such a method and a system, for accurate object and pose de termination with increased accuracy.

This object is solved by the subject-matter of the independ ent claims.

According to a first aspect, a computer-implemented method for object detection and pose estimation within an input im age is provided, the method comprising steps of:

receiving an input image;

inputting the received input image into an artificial in telligence entity, in particular a trained encoder-decoder (preferably: convolutional) artificial neural network, com prising an encoder head, an ID mask decoder head, a first correspondence color channel decoder head and a second corre spondence color channel decoder head;

generating, using the ID mask decoder head, from the re ceived input image an ID mask identifying objects and a back ground in the received input image;

generating, using the first correspondence color channel decoder head, from the received input image a first corre spondence color channel of a (robust dense) 2D-to-3D- correspondence map for objects within the received input im age ;

generating, using (or: based on) the second correspond ence color channel decoder head, from the received input im age a second correspondence color channel of the 2D-to-3D- correspondence map; generating the 2D-to-3D-correspondence map using the gen erated first correspondence color channel and the generated second correspondence color channel; and

determining (e.g. using a pose determining module) for at least one object identified by the ID mask, a pose estimation (in particular a 6-DoF pose estimation) based on the generat ed 2D-to-3D-correspondence map and on a pre-generated corre spondence model of the object, wherein the pre-generated cor respondence model bijectively associates points of the object with unique value combinations in the first and the second correspondence color channels.

Specifically, the first correspondence color channel may be a first color channel of an RGB color scheme, and/or the second correspondence color channel may be a second color channel of the RGB color scheme different from the first color channel.

It should be understood that the correspondence color chan nels do not indicate color values of the pixels of the input image in the respective colors; the correspondence color channels denote, by different levels of intensity of color, spatial correspondences between different points on objects according to the pre-generated bijective association of the points with the unique value combinations in the correspond ence color channels. For example, a pixel that is completely red in the RGB input image may still have a 100% level of Blue in the 2D-to-3D-correspondence map which indicates spa tial proximity to e.g. points having a 99% level of Blue or the like.

The inventors have found that formulating color regression problems as discrete color classification problems results in much faster convergence and superior quality of the 2D-3D- matching .

The approach described herein does not rely on regressing bounding boxes and using regions-of-interest (ROI) layers but instead uses ID masks to provide a deeper understanding of the objects in the input image. It has been found by the in ventors that the present method outperforms existing RGB ob ject detection and 6DoF pose estimation methods (also desig nated as "pipelines") .

The 2D-to-3D-correspondence map may in particular be a dense correspondence map as described e.g. in "DensePose" cited above in the sense that the correspondence map for each ob ject covers with a predefined minimum resolution all points (surface points and/or wire model vertices) .

Note that other works usually only compute a limited number of 2D-3D correspondences, e.g. nine for Y0L06D. Therefore, these approaches can be referred to as "coarse". In contrast, in the present case, much more than nine correspondences are obtained, therefore the term "dense". As a result, with the present method a final object pose can be computed more ro bustly: if some correspondences are missing, there are still other ones.

The input image may in particular be an RGB image, e.g. rep resented by an H x W x 3-dimensional tensor, with H marking the height of the input image in pixels, W the width of the input image in pixels (such that H x W is the total number of pixels of the input image) , and 3 stemming from the three color channels Red, Green, and Blue. Advantageously, the ID mask can identify each of a plurality of objects within the input image.

The ID mask may be represented by an H x W x N_0+i dimensional tensor, wherein N_0+i is the number of (known and trained) identifiable objects plus 1 for the background, such that for each pixel a feature (or: class) is available that designates with which probability that pixel belongs to each of the identifiable objects or to the background.

For example, when only three objects are known and trained (N_0+i=4), then a specific pixel at height position 100 and at width position 120 may e.g. have 0.15 probability of belong ing to a first object, 0.35 probability of belonging to a second object, 0.4 probability of belonging to a third ob ject, and 0.1 probability of belonging to the background. The vector at the entry of H=100 and W=120 may then e.g. read

[0.15, 0.35, 0.4, 0.1] . In some embodiments it may then be finally decided, when detecting the objects in the input im age, that each pixel belongs to the class (i.e. object or background) with the highest probability. In the present ex ample, it would then be decided that the pixel at H=100 and W=120 belong to the third object (probability of 40%) .

The first and the second correspondence color channels U, V are advantageously provided each with a number Nu, N_v of pos sible classes or features, each indicating the probability for that pixel of belonging to a certain color value in the respective correspondence color channel of the 2D-to-3D- correspondence map. The 2D-to-3D-correspondence map may then be represented by a H x W x Nu x N_v -dimensional tensor. In this way, each pixel of the input image (denoted by H and W) is provided with Nu x N_v features_, each indicating with which probability that pixel belongs to a specific color combina tion (combination of levels of intensity of the two colors) .

For example, the 2D-to-3D-correspondence map may be designed such that it distinguishes between 256 different values (lev els of intensity of Blue color) for the first correspondence color channel and between 256 different values (levels of in tensity of Green color) for the second correspondence color channel. The 2D-to-3D-correspondence map thus will comprise 256x256=65536 uniquely determined pixels, each having its own combination shade of blue and/or green color.

In some embodiments it may then be finally decided that each pixel has, for each correspondence color channel, the value with the highest probability. For example, the above- mentioned pixel at H=100 and W=120 may have the highest prob ability (e.g. of 0.12 or 12%) of having level of intensity 245 out of 255 in the first correspondence color channel, and the highest probability (e.g. of 0.16 or 16% of having level of intensity 136 out of 255 in the second correspondence col or channel. It may then be finally decided that that pixel at H=100 and W=120 has the value combination 245 (first corre spondence color channel) / 136 (second correspondence color channel) .

The correspondence maps bijectively associate points of the objects with unique value combinations in the first and sec ond correspondence color channels, i.e. each combination shade of colors corresponds uniquely to a point of the object and vice versa. This means that, consequently, it is decided that the pixel at H=100 and W=120 corresponds to the point on the third object which is associated with that color combina tion shade 245/136. In some advantageous embodiments, the first correspondence color channel is a Blue color channel and/or the second cor respondence color channel is a Green color channel. It has been found that this color combination is especially easy to discern by artificial intelligence entities.

In some advantageous embodiments, the determining of the at least one pose estimations uses a Perspective-n-Point, PnP, algorithm. The PnP algorithm may be provided as described e.g. in "Z. Zhang. A flexible new technique for camera cali bration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11) :1330 - 1334. December 2000".

In some advantageous embodiments, the PnP algorithm is used with a Random Sample Consensus, RANSAC, algorithm.

In some advantageous embodiments, the determining of the at least one pose estimation uses a trained artificial neural network entity configured and trained to generate, from the ID mask and the 2D-to-3D-correspondence map, probabilities for each of a plurality of 3D poses of the at least one ob ject , and preferably for all objects identified by the ID mask. Thereafter, that pose may be determined to be correct by the pose determining module for which the highest proba bility has been determined.

In some advantageous embodiments, the method according to the first aspect further comprises a step of generating the bi- jective association for at least one object, said bijective association being generated by texturing a 3D representation of the object using a 2D correspondence texture consisting of a plurality of pixels, each pixel having a unique value com- bination in the first and the second correspondence color channels. Each point of the model, e.g. a vertex of a wire model or a CAD model, is then associated with the unique val ue combination (or: color shade combination) with which it has been textured.

In some advantageous embodiments, texturing the 3D represen tation of the object is performed using a spherical projec tion. Spherical projections uses a full sphere to texture an object from all sides.

According to a further aspect, a computer program comprising executable program code configured to, when executed (e.g. by a computing device) , perform the method according to the first aspect of the invention.

According to another aspect, a non-transitory computer- readable data storage medium comprising executable program code configured to, when executed (e.g. by a computing de vice) , perform the method according to the first aspect of the invention, .

According to yet another aspect, a data stream comprising (or configured to generate) executable program code configured to, when executed (e.g. by a computing device), perform the method according to the first aspect of the invention.

According to a second aspect, a system for object detection and pose estimation within an input image is provided, the system comprising:

an input interface for receiving an input image;

a computing device configured to implement a trained encoder-decoder (preferably: convolutional) artificial neural network comprising an encoder head, an ID mask decoder head, a first correspondence color channel decoder head and a sec ond correspondence color channel decoder head;

wherein the ID mask decoder head is configured and trained to generate an ID mask identifying objects and back ground in the received input image;

wherein the first correspondence color channel decoder head is configured and trained to generate a first corre spondence color channel of a dense 2D-to-3D-correspondence map for objects within the received input image;

wherein the second correspondence color channel decoder head is configured and trained to generate a second corre spondence color channel of the dense 2D-to-3D-correspondence map;

wherein the computing device is further configured to implement a combining module and a pose determining module; wherein the combining module is configured to generate the dense 2D-to-3D-correspondence map using the generated first correspondence color channel and the generated second correspondence color channel; and

wherein the pose determining module is configured to determine, for at least one object identified by the ID mask (and preferably for all objects identified by the ID mask), a pose estimation (in particular a 6DoF pose estimation) based on the generated 2D-to-3D-correspondence map and on a pre generated bijective association of points of the object with unique value combinations in the first and the second corre spondence color channels.

The computing device may be realised as any device, or any means, for computing, in particular for executing a software, an app, or an algorithm. For example, the computing device may comprise a central processing unit (CPU) and a memory op- eratively connected to the CPU. The computing device may also comprise an array of CPUs, an array of graphical processing units (GPUs), at least one application-specific integrated circuit (ASIC) , at least one field-programmable gate array, or any combination of the foregoing.

Some, or even all, modules of the system may be implemented by a cloud computing platform.

According to a third aspect, a method for providing training data for training an encoder-decoder (preferably: convolu tional) artificial neural network for use in the method ac cording to an embodiment of the first aspect is provided, the method comprising:

providing, for each of a plurality of objects, a re spective RGB image 2D patch corresponding to each of a plu rality of poses of that object;

providing, for each of the plurality of objects and for each of the plurality of poses, a corresponding ground truth ID mask patch;

providing, for each of the plurality of objects and for each of the plurality of poses, a corresponding ground truth 2D-to-3D-correspondence map patch;

providing a plurality of background images;

arranging at least one of the provided RGB image 2D patches onto at least one of the plurality of background im ages in order to generate a sample input image,

correspondingly arranging the corresponding ground truth 2D-to-3D-correspondence map patches onto a black back ground to provide a ground truth for the 2D-to-3D- correspondence map for the generated sample input image, and correspondingly arranging the corresponding ground truth ID mask patches onto the black background to provide a ground truth for the ID mask for the generated sample input image .

This method allows to easily provide a large number of train ing data for training the encoder-decoder convolutional arti ficial neural network to a desired degree of accuracy.

In some advantageous embodiments of the method according to the third aspect, a 3D representation (or: 3D model) for at least one object is provided, for example a CAD model. The 3D representation may be rendered in each of the plurality of poses on the black background, or, expressed in other words, may be rendered from each of a plurality of virtual camera viewpoints. This approach makes it possible to use known tools and software for rendering 3D representations of ob jects from different viewpoints, e.g. using graphical engines running on arrays of graphical processing units (GPUs) .

It is evidently equivalent whether one chooses to define the (virtual) camera view angle as fixed (as it might be during the inference stage, with the actual camera producing input images being fixed e.g. to a wall) while the objects are able to take different poses, or whether one chooses to define the object as fixed while the (virtual) camera is able to take different viewpoints. In the inference stage, of course, both the object (s) and a camera producing the input images may be able to move with respect to each other.

Each of the images resulting from the rendering may be cropped, in particular so as to cut off all pixels not com prising a single object. Based on each of the cropped result ing images (or, if no cropping is performed, on the uncropped resulting images) a corresponding RGB image 2D patch and a corresponding ground truth ID mask patch are generated. The corresponding RGB image 2D patch shows the actual appearance of the object in the corresponding pose, i.e. from the corre sponding viewpoint. The corresponding ground truth ID mask patch separates the RGB image 2D patch from the black back ground, i.e. it identifies the pixels within the RGB image 2D patch shape as belonging to a specific class (either one of the known object classes or the background class) .

In some advantageous embodiments, when rendering the 3D rep resentation, a corresponding depth channel is used to gener ate a depth map for the corresponding object in the corre sponding pose. The depth map may then be used to generate a bounding box used to crop each resulting image, respectively. Pixels corresponding to depths of points of the object will generally have finite depth values below a certain threshold, whereas pixels corresponding to the background may have infi nite depth values or at least depth values larger than the certain threshold. Cropping may then be performed by cutting off all pixels with depth values larger than the certain threshold .

According to a fourth aspect, a system for providing training data is provided, comprising an input interface for receiving the plurality of background images (and, optionally, for re ceiving additional data such as 3D representations of the ob jects, or data based on which such 3D representations of the objects may be generated), a computing device configured to perform the method according to an embodiment of the method according to the third aspect, and an output interface for outputting the sample input images together with its ground truth ID mask and its ground truth 2D-to-3D-correspondence map as a training data set. Preferably, the computing device may be realized as an online data generator running on multiple CPU threads constantly putting prepared batches in a queue, from which they are picked as inputs to an encoder-decoder architecture being trained. In other words, the computing device may be config ured to continuously provide training data to an encoder- decoder architecture being trained.

The invention also provides a computer program comprising ex ecutable program code configured to, when executed (e.g. by a computing device) , perform the method according to the third aspect of the invention.

The invention also provides a non-transitory computer- readable data storage medium comprising executable program code configured to, when executed (e.g. by a computing de vice) , perform the method according to the third aspect of the invention.

The invention also provides a data stream comprising (or con figured to generate) executable program code configured to, when executed (e.g. by a computing device), perform the meth od according to the third aspect of the invention.

According to a fifth aspect, a method for training an encod er-decoder (preferably: convolutional) artificial neural net work is provided, the method comprising:

providing (preferably using a method according to an embodiment of the third aspect) a plurality of tuples of cor responding sample input images, ground truth ID masks and ground truth 2D-to-3D-correspondence maps; and training an encoder-decoder (preferably: convolutional) artificial neural network configured to receive the sample input images as input and to output both an ID mask and a 2D- to-3D-correspondence map as output, the training being per formed using a loss function penalizing deviations of the output from the ground truth ID mask and the ground truth 2D- to-3D-correspondence map.

In some advantageous embodiments, the plurality of tuples is provided continuously during training. In other words, the tuples may be generated dynamically, or on-line, during the performing of the training method. This saves the need for generating and storing enormous quantities of data before performing the training.

The plurality of tuples may be provided continuously during training e.g. by the method according to the third aspect be ing performed by an online data generator running on multiple CPU threads constantly putting prepared batches in a queue, from which they are picked as inputs to an encoder-decoder architecture being trained.

According to another aspect, a system for training an encod er-decoder convolutional artificial neural network is provid ed, the system comprising:

an input interface configured to receive a plurality of tuples (preferably provided using a method according to an embodiment of the third aspect) of corresponding sample input images, ground truth ID masks and ground truth 2D-to-3D- correspondence maps; and

a computing device configured to train an encoder- decoder convolutional artificial neural network configured to receive the sample input images as input and to output both an ID mask and a 2D-to-3D-correspondence map as output, the training being performed using a loss function penalizing de viations of the output from the ground truth ID mask and the ground truth 2D-to-3D-correspondence map.

The invention also provides a computer program comprising ex ecutable program code configured to, when executed (e.g. by a computing device) , perform the method according to the fifth aspect of the invention.

The invention also provides a non-transitory computer- readable data storage medium comprising executable program code configured to, when executed (e.g. by a computing de vice) , perform the method according to the fifth aspect of the invention.

The invention also provides a data stream comprising (or con figured to generate) executable program code configured to, when executed (e.g. by a computing device), perform the meth od according to the fifth aspect of the invention.

Further advantageous variants and embodiments are described and comprised in the dependent claims as well as in the fol lowing description in combination with the drawings.

Brief description of the drawings

The invention will be explained in greater detail with refer ence to exemplary embodiments depicted in the drawings is ap pended .

The accompanying drawings are included to provide a further understanding of the present invention and are incorporated in and constitute a part of this specification. The drawings illustrate the embodiments of the present invention and to gether with the description serve to explain the principles of the invention.

Other embodiments of the present invention and many of the intended advantages of the present invention will be readily appreciated as they become better understood by reference to the following detailed description. Like reference numerals designate corresponding similar parts. It shall be understood that method steps are numbered for easier reference but that said numbering does not necessarily imply steps being per formed in that order unless explicitly or implicitly de scribed otherwise. In particular, steps may also be performed in a different order than indicated by their numbering. Some steps may be performed simultaneously or in an overlapping manner .

Fig. 1 schematically shows a block diagram illustrating a system according to an embodiment of the second as pect ;

Fig. 2 schematically illustrates a method according to an embodiment of the third aspect;

Fig. 3 schematically illustrates a method according to an embodiment of the fifth aspect;

Fig. 4 schematically illustrates a computer program product according to an embodiment; and

Fig. 5 schematically illustrates a data storage medium ac cording to an embodiment. Detailed description of the invention

Fig. 1 schematically illustrates a computer-implemented meth od for object detection and pose estimation within an input image according to an embodiment of the first aspect, as well as a system 1000 for object detection and pose estimation within an input image according to an embodiment of the sec ond aspect.

It should be understood that all advantageous options, vari ants and modifications described herein and in the foregoing with respect to embodiments of the method according to the first aspect may be equally applied to, or provided in, em bodiments of the system according to the second aspect, and vice versa.

In the following, the method will be explained with reference to Fig. 1 and in particular in connection with features of the system 1000. It should be understood that the method is, however, not restricted to being performed with the system 1000 and vice versa.

The system 1000 comprises an input interface 10, a computing device 100, and an output interface 50.

In a first step S10 of the method, an input image 1 is re ceived, for example by the input interface 10 of the system 1000. The input interface 10 may consist of, or comprise, a local interface connected to e.g. a bus system of a factory or a hospital and/or an interface for remote connections such as an interface for connecting to a wireless LAN or WAN con- nection, in particular to a cloud computing system and/or the Internet .

The input image 1 is preferably an RGB image, e.g. the input image 1 is preferably represented by an H x W x 3-dimensional tensor, with H marking the height of the input image 1 in pixels, W the width of the input image 1 in pixels (such that H x W is the total number of pixels of the input image) , and 3 stemming from the three RGB color channels Red, Green, and Blue (RGB) .

The input image preferably comprises one or more objects 11 which the method or the system is intended to detect (or: identify) and for which 6-DoF pose information is required.

In a step S20, the received input image 1 is input into a trained artificial intelligence entity, in particular a trained encoder-decoder convolutional artificial neural net work 20. The trained encoder-decoder convolutional artificial neural network 20 is implemented as comprising an encoder head 22, an ID mask decoder head 24, a first correspondence color channel decoder head 26 and a second correspondence color channel decoder head 28. It should be understood that that additional color decoder heads, associated with addi tional colors may be provided for additional reliability and/or redundancy. In particular, a third color decoder head may be provided optionally.

The trained encoder-decoder convolutional artificial neural network 20 is configured and trained for detection and pose determination of specific, previously known objects. For ex ample, in a factory environment, such objects may comprise robots, workpieces, vehicles, movable equipment, source mate- rials and/or the like. As an example for objects, in Fig. 1 the input image 1 is shown as comprising a teddy bear 11, an egg carton 12, and a camera 13, before a background 14.

The encoder head 22 may, for example, be realized by a 12- layer ResNet—like architecture featuring residual layers which allow for faster convergence. The ResNet architecture is described, for example, in "K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 770-778. IEEE."

However, in principle the proposed method is agnostic to a particular choice of encoder-decoder architecture of which a large variety is known. Any other backbone architectures can be used as well without any need to change the conceptual principles of the method.

In the system 1000, the trained encoder-decoder convolutional artificial neural network 20 may be implemented by the compu ting device 100.

The computing device 100 may be realised as any device, or any means, for computing, in particular for executing a soft ware, an app, or an algorithm. For example, the computing de vice may comprise a central processing unit (CPU) and a memory operatively connected to the CPU. The computing device 100 may also comprise an array of CPUs, an array of graphical processing units (GPUs), at least one application-specific integrated circuit (ASIC) , at least one field-programmable gate array, or any combination of the foregoing. Some, or even all, modules (in particular decoder and/or encoder heads) of the system 1000 may be implemented by a cloud com puting platform as a computing device 100.

In a step S22 the received input image 1 is encoded by the encoder head 22 into a latent (or: hidden) representation with latent features.

In a step S24, the ID mask decoder head 24 generates an ID mask 34 identifying objects 11, 12, 13 and background 14. In other words, ID mask decoder head 24 is configured and trained to generate, from the received input image 1, the ID mask 34.

Advantageously, the ID mask decoder head 24 is configured and trained to generate the ID mask 34 for all objects 11, 12, 13 known to the trained encoder-decoder convolutional artificial neural network 20 at the same time in the same data struc ture, by providing for each pixel of the input image and for each known object a probability for that pixel to belong to that object. The known objects thus are represented by prede fined classes, and the ID mask 34 comprises a feature for each class (object classes plus the background class) .

The ID mask 34 may, for example, be represented by an H x W x N_0+i dimensional tensor, wherein N_0+i is the number of (known and trained) identifiable objects 11, 12, 13 plus 1 for the background 14, such that for each pixel a feature is availa ble that designates with which probability that pixel belongs to each of the identifiable objects 11, 12, 13 or to the background 14.

In an optional step S35, it may be determined (e.g. by an ob ject identifier module 35 implemented by the computing device 100) for each pixel to which object 11, 12, 13 or background 14 it belongs, preferably by determining for each pixel the feature of the ID mask 34 with the highest probability value. The result of this determination may be stored in a finalized ID mask.

For example, when only the three objects 11, 12, 13 described above are known and trained (N_0+i=4), then a specific pixel at height position 100 and at width position 120 may e.g. have 0.15 probability of belonging to a first object (teddy bear 11), 0.35 probability of belonging to a second object (carton of eggs 12), 0.4 probability of belonging to a third object (camera 13), and 0.1 probability of belonging to the back ground 14. The feature vector at the entry of H=100 and W=120 may then e.g. read [0.15, 0.35, 0.4, 0.1].

In a step S26, the first correspondence color channel decoder head 26 generates a first correspondence color channel 36 of a 2D-to-3D-correspondence map 31 for objects 11, 12, 13 with in the received input image 1. In other words, the first cor respondence color channel decoder head 26 is configured and trained (and implemented) to generate from the received input image 1 as its input the first correspondence color channel 36 of the 2D-to-3D-correspondence map 31 for objects 11, 12, 13 within the received input image 1. In the present example, the first correspondence color channel 36 is a Blue color channel of an RGB color scheme.

In a step S28, the second correspondence color channel decod er head 28 generates a second correspondence color channel 38 of the 2D-to-3D-correspondence map 31 for objects 11, 12, 13 within the received input image 1. In other words, the second correspondence color channel decoder head 28 is configured and trained (and implemented) to generate from the received input image 1 as its input the second correspondence color channel 38 of the 2D-to-3D-correspondence map 31 for objects 11, 12, 13 within the received input image 1. In the present example, the second correspondence color channel 38 is a Green color channel of an RGB color scheme. It is evident that also other color channels may be used for the first and the second correspondence color channel 38, or that even three color channels may be used, gaining additional relia bility and redundancy at the expense of computational re sources .

The first and the second correspondence color channels 36, 38 (which may be designated as U and V, respectively) are advan tageously provided each with a number N₀, N_v of features (or: classes), each indicating the probability for that pixel of belonging to a certain color value in the respective corre spondence color channel 36, 38 of the 2D-to-3D-correspondence map. The 2D-to-3D-correspondence map may then be represented by a H x W x Nu x N_v -dimensional tensor. In this way, each pixel of the input image (identified by its H and W values) is provided with Nu x N_v features_, each indicating with which probability that pixel belongs to a specific color combina tion (combination of levels of intensity of the two colors) .

For example, the 2D-to-3D-correspondence map 31 may be de signed such that it distinguishes between 256 different val ues (levels of intensity of Blue color) for the first corre spondence color channel and between 256 different values (levels of intensity of Green color) for the second corre spondence color channel 38. The 2D-to-3D-correspondence map thus will comprise 256x256=65536 uniquely determined pixels, each having its own combination shade of blue and/or green color .

The ID mask decoder head 24 and the first and second corre spondence color channel decoder heads 26, 28 upsample the la tent features generated by the encoder head 22 up to its original size using a stack of bilinear interpolations fol lowed by convolutional layers. However, it should be under stood that, again, the proposed method is agnostic to a par ticular choice of encoder-decoder architecture such that any known encoder-decoder architecture may be used for the encod er-decoder convolutional artificial neural network 20.

In a step S32, the 2D-to-3D-correspondence map 31 is generat ed map using the generated first correspondence color channel 36 and the generated second correspondence color channel 38, in particular in that each pixel is assigned a color shade combination given by the feature (or: class representing a level of intensity of color) with the highest probability from the first correspondence color channel 36as well as by the feature (or: class representing a level of intensity) with the highest probability from the second correspondence color channel 38.

For example, the above-mentioned pixel at H=100 and W=120 may have the highest probability (e.g. of 0.12 or 12%) of having level of intensity 245 out of 255 in the first correspondence color channel 36, and the highest probability (e.g. of 0.16 or 16% of having level of intensity 136 out of 255 in the second correspondence color channel 38. It may then be final ly decided that that pixel at H=100 and W=120 has the value combination 245 (first correspondence color channel 36) / 136 (second correspondence color channel 38) . The 2D-to-3D- correspondence map 31 will then mark that pixel with the col or shade combination 245/136.

Alternatively, already in steps S26 and S28, respectively, a respective single correspondence color channel image may be generated in which each pixel stores the class (i.e. level of intensity of color) in the respective color for which the highest probability has been determined. In the above exam ple, the first color (Blue) single correspondence color chan nel image would have the pixel at H=100 and W120 have value 245, and the second color (Green) single correspondence color channel image would have the same pixel have value 136.

Step S32 may be performed by a combining module 32 implement ed by the computing device 100 of the system 1000.

In a step S40, a pose estimation 51, 52, 53 for at least one object 11, 12, 13, preferably for all objects 11, 12, 13 within the input image 1, is generated based on the generated 2D-to-3D-correspondence map 31 and on a corresponding pre generated bijective association of points of the at least one object with unique value combinations in the first and the second correspondence color channels 36, 38. Preferably, for each object 11, 12, 13 identified by the ID mask 34, a corre sponding pose estimation 51, 52, 53 is generated (or: deter mined, or provided) .

The points of the at least one object may be points on the surface of the object or vertices of a wire model of the ob ject or of a wire model approximating the object 11, 12, 13.

Step S40 may be performed by a pose determining module 40 im plemented by the computing device 100 of the system 1000. In other words, the pose determining module 40 is configured to generate a pose estimation for at least one object, prefera bly for all objects within the input image 1, based on the generated 2D-to-3D-correspondence map 31 and on a correspond ing pre-generated bijective association of points of the at least one object with unique value combinations in the first and the second correspondence color channels 36, 38.

Step S40 may utilize an algorithm 42 such as the known Per- spective-n-Point, PnP, algorithm, optionally combined with a Random Sample Consensus, RANSAC, algorithm. The PnP algorithm estimates the pose (i.e. relative orientation of each object relative to a viewpoint from which the input image 1 was tak en, e.g. the location of a camera) given correspondences and intrinsic parameters of the camera. The PnP algorithm refined with the RANSAC algorithm is more robust against possible outliers given many correspondences. The PnP algorithm may be provided as described e.g. in "Z. Zhang. A flexible new tech nique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11) :1330 - 1334. Decem ber 2000".

Alternatively or additionally, a trained artificial intelli gence entity 44 such as a trained artificial neural network may be employed to generate the pose determination from the 2D-to-3D-correspondence map 31 as well as from the infor mation about the objects 11, 12, 13 and the background 14 contained in the ID mask 34 or in a finalized ID mask as de scribed in the foregoing.

The result of step S40, or, respectively, of the pose deter mining module 40, is then output, e.g. by the output inter face 50 of the system 1000. The output interface 50 may con- sist of, or comprise, a local interface connected to e.g. a bus system of a factory or a hospital and/or an interface for remote connections such as an interface for connecting to a wireless LAN or WAN connection, in particular to a cloud com puting system and/or the Internet. The determined pose esti mation 51, 52, 53 may be output, for example, by twelve val ue: nine rotation values Rij forming a rotation matrix for describing the orientation of each object 11, 12, 13, and three location values Tx, Ty, Tz forming a vector describing the center of mass, or space point, of the each object 11,

12, 13.

The bijective association of points of the objects 11, 12 13 with unique value combination in the first and second corre spondence color channels 36, 38 may be generated by the steps as described in the following with respect to Fig. 2. It should be understood that these steps may be performed as part of the method according to the first aspect, in particu lar for each object 11, 12, 13 that shall be known by the trained encoder-decoder convolutional artificial neural net work 20, or, in other words, for which the encoder-decoder convolutional artificial neural network 20 shall be trained.

However, the steps as described with respect to Fig. 2 may also be steps of a method according to the third aspect, i.e. of a method for providing training data for training an en coder-decoder convolutional artificial neural network.

In a step S110, for each of a plurality of objects 11, 12,

13, a respective RGB image 2D patch corresponding to each of a plurality of poses of that object is provided, in particu lar generated. In a step S120, for each of the plurality of objects and for each of the plurality of poses, a corresponding ground truth ID mask patch is provided, in particular generated.

For example, a 3D representation (or: model, for example a CAD model) of at least one object 11, 12, 13, or for all ob jects 11, 12, 13, may be provided and rendered as an RGB im age before a black background from different viewpoints, each viewpoint corresponding to one pose of the object 11, 12, 13 with respect to the viewpoint. For generating the respective RGB image 2D patch, the rendered RGB image may be cropped to comprise as few pixels beside one single object 11, 12, 13 as possible, and the background may be cut off. The result is the RGB image 2D patch that exactly covers the corresponding object 11, 12, 13 in the corresponding pose (i.e. from the corresponding viewpoint) .

According to one variant, for given 3D representations of the objects 11, 12, 13 of interest, a first sub-step is to render them in different poses. The poses are defined e.g. by the vertices of an icosahedron ("sampling vertices") placed around the 3D representation of each object, respectively. To achieve finer sampling, triangles of each icosahedron may be are recursively subdivided into four smaller ones until the desired density of sampling vertices, each corresponding to a pose, of the 3D representation is obtained. For example, 4 subdivisions are used.

Additionally, the virtual view camera may be rotated at each sampling vertex around its viewing direction between two lim its with a fixed stride, for example from -30 to 30 degrees, with a stride of 5 degrees, to model in—plane rotations re sulting in yet additional poses. Then, for each of the poses, each object is rendered on a black background and both RGB and depth channels are stored. Using the depth channels, a depth map may be generated for each pose. Having the renderings at hand, the generated depth map can be used as a mask to define a tight bounding box for each generated rendering, i.e. a bounding box comprising as few pixels as possible besides the object 11, 12, 13 of in terest in the rendering.

The image may then be cropped with this bounding box. The RGB patches cut out from the backgrounds and stored as the RGB image 2D patches. Masks separating these patches from the background are stored as ground truth ID mask patches, and the corresponding poses (i.e. virtual camera positions or relative orientation of the object with respect to the virtu al camera) are stored as ground truth poses.

Of course, pairs of RGB image 2D patches and corresponding ground truth ID mask patches can also be provided by detect ing and annotating real world objects in real world images.

In a step S130, for each of the plurality of objects 11, 12, 13 and for each of the plurality of poses, a corresponding ground truth 2D-to-3D-correspondence map patch is provided, in particular generated.

For example, step S130 may comprise providing, or generating, a correspondence model for each object 11, 12, 13, wherein the correspondence model is generated by texturing the 3D representation of the object 11, 12, 13 using a 2D corre spondence texture consisting of a plurality of pixels, each pixel having a unique value combination in the first and the second correspondence color channels 36, 38. Texturing the 3D representation may be performed using a spherical projection of a 2D correspondence texture onto the 3D representation of the object 11, 12, 13. The 2D correspondence texture may e.g. be a 2D image with color intensity levels ranging from e.g. 1 to 255 for both Blue (first dimension) and Green (second di mension) .

In the same way as described above for steps S110 and S120, using the same poses and the same black background, then ground truth 2D-to-3D-correspondence map patches can be gen erated, i.e. patches covering the objects in the respective pose but showing not real RGB color values but the color com bination shades of the 2D correspondence texture which indi cate the spatial arrangement of the pixels with respect to each other.

After performing steps S110, S120 and S130, ground truths for both the ID mask and the 2D-to-3D-correspondence map for sin gle objects 11, 12, 13 in single poses are prepared, together with the corresponding RGB image 2D patch.

In a step S140, at least one background, preferably a plural ity of background images, is provided, for example images from the Microsoft (Trademark) Common Objects in Context, MSCOCO, dataset. Varying the background images between a larger number of background images has the advantage that during training the encoder-decoder architecture does not overfit to the backgrounds. In other words, this ensures that the encoder-decoder architecture generalizes to different backgrounds and prevents it from overfitting to backgrounds seen during training. Moreover, it forces the encoder-decoder architecture to learn model's features needed for pose esti- mation rather than to learn contextual features which might not be present in images when the scene changes.

Then, in a step S150, at least one of the provided RGB image 2D patches (comprising ideally all of the objects in all of the poses) is arranged onto at least one of the plurality of background images in order to generate a sample input image. Optionally, this sample input image is also augmented, e.g. by random changes in brightness, saturation, and/or contrast values, and/or by adding a Gaussian noise.

Correspondingly, in a step S160, the corresponding ground truth 2D-to-3D-correspondence map patches are arranged (in the same positions and orientations as the RGB image 2D pages onto the chosen background image) onto a black background (with the same dimensions as the chosen background image) to provide a ground truth for the 2D-to-3D-correspondence map for the generated sample input image.

Correspondingly, in a step S170, the corresponding ground truth ID mask patches onto the black background are arranged (in the same positions and orientations as the RGB image 2D pages onto the chosen background image) to provide a ground truth for the ID mask for the generated sample input image.

All of the step S110 to S170, or at least in particular steps S150 to S170 may be performed in an online fashion, i.e. per formed dynamically whenever another training set is required when training a encoder-decoder architecture.

A system for providing training data may be provided, with an input interface for receiving the plurality of background im ages (and, optionally, for receiving additional data such as the 3D representations of the objects, or data based on which such 3D representations of the objects may be generated), a computing device configured to perform steps S110 to S170, and an output interface for outputting the sample input imag es together with its ground truth ID mask and its ground truth 2D-to-3D-correspondence map as a training data set.

Preferably, the computing device may be realized as an online data generator running on multiple CPU threads constantly putting prepared batches in a queue, from which they are picked as inputs to an encoder-decoder architecture being trained .

The invention also provides a computer program comprising ex ecutable program code configured to, when executed (e.g. by a computing device) , perform the method according to the third aspect of the invention, in particular as described with re spect to Fig. 2.

The invention also provides a non-transitory computer- readable data storage medium comprising executable program code configured to, when executed (e.g. by a computing de vice) , perform the method according to the third aspect of the invention, in particular as described with respect to Fig. 2.

The invention also provides a data stream comprising (or con figured to generate) executable program code configured to, when executed (e.g. by a computing device), perform the meth od according to the third aspect of the invention, in partic ular as described with respect to Fig. 2. Fig. 3 schematically illustrates a method according to an em bodiment of the fifth aspect, for training an encoder-decoder architecture for use in an embodiment of the method according to the first aspect and/or for use in an embodiment of the system according to the second aspect.

In a step S210, a plurality of tuples of corresponding sample input images, ground truth ID masks and ground truth 2D-to- 3D-correspondence maps is provided. This may comprise per forming the method according to the third aspect, in particu lar the method as described with respect to Fig. 2.

Advantageously, step S210 is performed in an online fashion, i.e. tuples are generated continuously, possible on multiple parallel CPU threads.

The tuples, in particular when they are generated based on real world data so that they are much more limited in number, may be divided into a train subset and a test subset which do not overlap. Preferably, between 10% and 20% of the tuples (preferably 15%) may be used for the train subset and the re mainder for the test subset.

Preferably, the tuples for the same object are selected such that the relative orientations of the poses between them is larger than a predefined threshold. This guarantees, for a corresponding minimum number of poses, that the tuples se lected cover each object from all sides.

In a step S220, an encoder-decoder convolutional artificial neural network 20 configured to receive the sample input im ages as input and to output both an ID mask and a 2D-to-3D- correspondence map as output is trained, the training being performed using a loss function penalizing deviations of the output from the ground truth ID mask and the ground truth 2D- to-3D-correspondence map.

For example, a composite loss function Lcomp given by a sum of individual loss functions for mask loss Lm, first corre spondence color channel loss LU and second correspondence color channel loss LV may be used:

Lcomp = a x Lm + b x LU + c x LV,

wherein x as usual denotes multiplication and a, b, and c may be weight factors that can also be set to 1. Mask loss Lm in dicates and penalizes loss caused by deviation of the result of the encoder-decoder architecture for the ID mask from the ground truth ID mask. First/second correspondence color chan nel loss LU, LV indicates and penalizes loss caused by devia tion of the result of the encoder-decoder architecture for the 2D-to-3D-correspondence map from the ground truth 2D-to- 3D-correspondence map.

LU and LV may be defined as multi-class cross-entropy func tions. Lm may be a weighted version of a multi-class cross entropy function, given e.g. by:

Lm (y, c) =w_c (-y_c + log (Sum ( j , N_0+i, e^A ( y_j ) ) ) ,

wherein y_c is an output score of the class c, w sets the rescaling weight for each class (1st object/2nd object

/.../background), wherein Sum ( j , C, F ( j ) ) is the sum over e to the power of y_j for j running from 1 to N_0+i, wherein N_0+i is the total number of classes (i.e. number of object classes, or different objects, plus 1 for the background class) . The rescaling weights are preferably set to 0.01 for the back ground class and 1 for each object class.

In the foregoing detailed description, various features are grouped together in one or more examples or examples with the purpose of streamlining the disclosure. It is to be under stood that the above description is intended to be illustra tive, and not restrictive. It is intended to cover all alter natives, modifications and equivalents. Many other examples will be apparent to one skilled in the art upon reviewing the above specification.

Fig. 4 schematically a computer program product 300 compris ing executable program code 350. The executable program code 350 may be configured to perform, when executed (e.g. by a computing device), the method according to the first aspect. Alternatively, the executable program code 350 may be config ured to perform, when executed (e.g. by a computing device), the method according to the third aspect, or the method ac- cording to the fifth aspect.

Fig. 5 schematically illustrates a non-transitory computer- readable data storage medium 400 comprising executable pro gram code 450 configured to, when executed (e.g. by a compu- ting de-vice) , perform the method according to the first as pect. Alternatively, the executable program code 450 may be configured to perform, when executed (e.g. by a computing de vice) , the method according to the third aspect, or the meth od according to the fifth aspect.

Reference Signs

I input image

10 input interface

II first object (teddy bear)

12 second object (carton of eggs)

13 third object (camera)

14 background

20 encoder-decoder artificial neural network

22 encoder head

24 ID mask decoder head

26 first correspondence color channel decoder head 28 second correspondence color channel decoder head

31 2D-to-3D-correspondence map

32 combining module

34 ID mask

35 object identifier module

36 first correspondence color channel

38 second correspondence color channel

40 pose determining module

42 algorithm

44 artificial intelligence entity

50 output interface

51 pose estimation for first object

52 pose estimation for second object

53 pose estimation for third object

100 computing device

1000 system

350 program code

450 data storage medium

450 program code

S110 to S 170

method steps S210 to S220

method steps

300 computer program product

Claims

Patent Claims

1. A computer-implemented method for object detection and pose estimation within an input image (1), comprising steps of :

receiving (S10) an input image (1);

inputting (S20) the received input image (1) into a trained encoder-decoder convolutional artificial neural net work (20) comprising an encoder head (22), an ID mask decoder head (24), a first correspondence color channel decoder head (26) and a second correspondence color channel decoder head (28) ;

generating (S24), using the ID mask decoder head (24), an ID mask (34) identifying objects (11, 12, 13) and back ground (14) in the received input image (10);

generating (S26), using the first correspondence color channel decoder head (24), a first correspondence color chan nel (36) of a 2D-to-3D-correspondence map (31) for objects (11, 12, 13) within the received input image (1);

generating (S28), using the second correspondence color channel decoder head (26), a second correspondence color channel (38) of the 2D-to-3D-correspondence map (31);

generating (S32) the 2D-to-3D-correspondence map (31) using the generated first correspondence color channel (36) and the generated second correspondence color channel (38); and

determining (S40), for at least one object (11, 12, 13) identified by the ID mask (34), a pose estimation (51, 52,

53) based on the generated 2D-to-3D-correspondence map (31) and on a pre-generated bijective association of points of the object with unique value combinations in the first and the second correspondence color channels (36, 38) .

2. The method of claim 1, wherein the first correspondence color channel (36) is a Blue color channel and/or wherein the second correspondence color channel (38) is a Green color channel .

3. The method of claim 1 or claim 2, wherein the determining (S40) of the pose estimation (51, 52, 53) uses a Perspective- n-Point, PnP, algorithm (42) .

4. The method of claim 3, wherein the PnP algorithm (42) is used with a Random Sample Consensus, RANSAC, algorithm.

5. The method of any of claims 1 to 4, wherein the determin ing (S40) of the pose estimation (51, 52, 53) uses a trained artificial neural network entity (44) configured and trained to generate, from the ID mask (34) and the 2D-to-3D- correspondence map (31), probabilities for each of a plurali ty of 3D poses of the at least one object (11, 12, 13) .

6. The method of any of claims 1 to 5,

further comprising a step of generating the bijective association for at least one object (11, 12, 13), said bijec tive association being generated by texturing a 3D represen tation of the object using a 2D correspondence texture con sisting of a plurality of pixels, each pixel having a unique value combination in the first and the second correspondence color channels (36, 38) .

7. The method of claim 6,

wherein texturing the 3D representation of the at least one object (11, 12, 13) is performed using a spherical pro- j ection .

8. A method for providing training data for training an en coder-decoder convolutional artificial neural network (20) for use in the method according to any of claims 1 to 7,

comprising :

providing (S110), for each of a plurality of objects (11, 12, 13), a respective RGB image 2D patch corresponding to each of a plurality of poses of that object (11, 12, 13); providing (S120), for each of the plurality of objects (11, 12 ,13) and for each of the plurality of poses, a corre sponding ground truth ID mask patch;

providing (S130), for each of the plurality of objects (11, 12, 13) and for each of the plurality of poses, a corre sponding ground truth 2D-to-3D-correspondence map patch;

providing (S140) a plurality of background images;

arranging (S150) at least one of the provided RGB image 2D patches onto at least one of the plurality of background images in order to generate a sample input image,

correspondingly arranging (S160) the corresponding ground truth 2D-to-3D-correspondence map patches onto a black background to provide a ground truth for the 2D-to-3D- correspondence map for the generated sample input image, and correspondingly arranging (S170) the corresponding ground truth ID mask patches onto the black background to provide a ground truth ID mask for the generated sample input image .

9. The method of claim 8,

wherein a 3D representation for at least one object is provided, and wherein the 3D representation is rendered in each of the plurality of poses on the black background;

wherein each of the resulting images is cropped; and wherein based on each of the cropped resulting images a corresponding RGB image 2D patch and a corresponding ground truth ID mask patch separating the RGB image 2D patch from the black back ground are generated.

10. The method of claim 9,

wherein, when rendering the 3D representation, a corre sponding depth channel is used to generate a depth map, and wherein the depth map is used to generate a bounding box used to crop each resulting image, respectively.

11. A method for training an encoder-decoder convolutional artificial neural network (20), comprising:

providing a plurality of tuples of corresponding sample input images, ground truth ID masks and ground truth 2D-to- 3D-correspondence maps; and

training an encoder-decoder convolutional artificial neural network (20) configured to receive the sample input images as input and to output both an ID mask (34) identify ing objects within the sample input images and a 2D-to-3D- correspondence map (31) as output, the training being per formed using a loss function penalizing deviations of the output from the ground truth ID mask and the ground truth 2D- to-3D-correspondence map.

12. The method of claim 11, wherein the plurality of tuples is provided continuously during training.

13. System (1000) for object detection and pose estimation within an input image (1), comprising:

an input interface (10) for receiving an input image

(l) ;

a computing device (100) configured to implement a trained encoder-decoder convolutional artificial neural net work (20) comprising an encoder head (28), an ID mask decoder head (22), a first correspondence color channel decoder head (24) and a second correspondence color channel decoder head (26) ;

wherein the ID mask decoder head (24) is configured and trained to generate (S24) an ID mask (34) identifying objects (11, 12, 13) and background (14) in the received input image (10) ;

wherein the first correspondence color channel decoder head (26) is configured and trained to generate (S26) a first correspondence color channel (36) of a 2D-to-3D- correspondence map (31) for objects (11, 12, 13) within the received input image (1);

wherein the second correspondence color channel decoder head (28) is configured and trained to generate (S28) a sec ond correspondence color channel (38) of the 2D-to-3D- correspondence map (31);

wherein the computing device is further configured to implement a combining module (32) and a pose determining mod ule (40);

wherein the combining module (32) is configured to gen erate (S32) the 2D-to-3D-correspondence map (31) using the generated first correspondence color channel (36) and the generated second correspondence color channel (38); and

wherein the pose determining module (40) is configured to determine (S40), for an object (11, 12, 13) identified by the ID mask (34), a pose estimation (51, 52, 53) based on the generated 2D-to-3D-correspondence map (31) and on a pre generated bijective association of points of the object with unique value combinations in the first and the second corre spondence color channels (36, 38) .

14. Computer program product (300) comprising executable pro gram code (350) configured to, when executed by a computing device (100), perform the method according to any of claims 1 to 7.

15. Non-transitory, computer-readable data storage medium (400) comprising executable program code (450) configured to, when executed by a computing device (100), perform the method according to any of claims 1 to 7.