WO2019192745A1

WO2019192745A1 - Object recognition from images using cad models as prior

Info

Publication number: WO2019192745A1
Application number: PCT/EP2018/080427
Authority: WO
Inventors: Benjamin PLANCHE; Sergey Zakharov; Andreas Hutter; Slobodan Ilic; Ziyan Wu
Original assignee: Siemens Aktiengesellschaft
Priority date: 2018-04-06
Filing date: 2018-11-07
Publication date: 2019-10-10

Abstract

The invention relates to a method how to recover an object from a cluttered image. The invention also relates to a computer program product and a computer-readable storage medium comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the mentioned method. Further, the invention relates to methods how to train a recognition system for recovering an object from such a cluttered image. In addition, the invention relates to such a recognition system.

Description

Specification

Object recognition from images using CAD models as prior

The invention relates to a method how to recover an object from a cluttered image. The invention also relates to a com puter program product and a computer-readable storage medium comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the mentioned method. Further, the invention relates to meth ods how to train a recognition system for recovering an ob ject from such a cluttered image. In addition, the invention relates to such a recognition system.

A reliable machine-based recognition of an object from an im age, as obtained e.g. from a photo or video camera, is a challenging task. Known recognition systems typically com prise detection means, such as a camera, and a computer- implemented method by which the nature - in other words, the class or category - of the object or the pose of the object can be recognized. The recognition system shall, for in stance, be able to recognize whether e.g. a cat, a car or a cactus is depicted on the image and/or how the pose of the object is relative to the camera.

As a concrete example, the recognition system receives a col or image as input. One object (e.g. a cat) of a predetermined set of objects (e.g. cat, car and cactus) is depicted in the input image. The object is depicted in a cluttered way, i.e. it is depicted before a specific background, at specific lightning conditions, partially occluded, noisy, etc. The task for the recognition system is to tell which one of the predetermined objects is actually depicted in the input image (here, the cat) . This task is also referred to as "object classification".

Another exemplary task for the recognition system would be to evaluate whether the cat is shown from the front, the back or from the side. This task is also referred to as "pose estima tion" .

Another exemplary task for recognition systems would be to detect objects (single or multiple) on an image, e.g. by de fining bounding boxes. This task is also referred to as "ob ject detection". Thus, the difference between object classi fication and object detection is that the latter one only identifies that there is any object depicted in the image, while the former one also determines the class (or instance) of the object.

Yet another common task for recognition systems in this con text would be to determine how many cats are actually depict ed in the image, even if they partially mask, i.e. occlude each other. This task is also referred to as "object count ing" .

As the recognition system shall in real life be capable to autonomously recover the object from an unseen cluttered im age, it needs to be trained beforehand.

A traditional approach to train the recognition system is to train it with a large amount of real, cluttered images de picting cats with, for instance, different appearances and before different backgrounds. This means that a large amount of labelled images of cats (and cars and cactus) need to be provided in order to train the recognition system.

Apart from the fact that the provision of large amounts of real, labelled training images is a time-consuming and tedi ous task, this may even be impossible in certain circumstanc es. For instance, in industrial applications where components of a machine need to be identified by the recognition system, it would be not acceptable to build up an amount of training images of the component of the machine, in particular if the machine is a unique, because customized exemplar. To solve the problem of lacking real training data, it has been proposed to train the recognition system purely on syn thetic images. In contrast to real images, a synthetic image is obtained by simulation based on certain input data. Input data which at least in industrial applications is widely available are computer-aided design (CAD) models of the com ponents of, for instance, the machine which shall be recog nized.

CAD models usually only have purely semantic and geometrical information, i.e. they do not contain any visual information. In other words, CAD models as such are assumed to be texture less. Texture information as well as lighting and shading in formation would only be contained in an image after a render ing process, which is understood as the process of generating an image (or "scene") containing the geometry, viewpoint, texture, lighting and shading information based on a 2D or 3D model .

The present invention focusses on texture-less CAD models as input data, i.e. as priors, for the generation of training data. It is known to generate color images from texture-less CAD models. These color images can be used as training images for the recognition system. The training images, which are cluttered color images comprising the object to be recognized before a background and including lighting, shading, tex tures, noise, occlusions, etc. are obtained conventionally using a graphics processing unit (GPU) . The recognition sys tem subsequently uses the synthetically generated cluttered color images as input images during its training phase.

Hence, the recognition system has the task to identify the desired feature of the object (e.g. class or pose) from the synthetic cluttered color image. This training is deemed to be a supervised training, as the results of the recognition system (e.g. the statement that a cactus and not a car is de picted in the cluttered image) are compared with the true re sults, which are known, as they represent the input CAD model which have been used to generate the cluttered images. After many iteration steps, which are carried out during the train- ing phase, the recognition system gets more and more accurate in determining the required features of the object being de- picted in the synthetic cluttered image.

After the recognition system is trained, it can be used for identifying the nature and/or features of the object in un seen, real cluttered images. Generally, the object to be rec ognized in the image need to be one for which the recognition system had been trained for during the training phase before. Depending on the training level of the recognition system, the recognition system can thus be used to more or less accu rately determine the desired features on unseen, real clut tered color images.

A severe and well-known problem for computer vision methods that rely on synthetic data is the so-called realism gap, as the knowledge acquired on these modalities usually poorly translates to the more complex real domain, resulting in a dramatic accuracy drop. Several ways to tackle this issue have been investigated so far.

A first obvious solution is to improve the quality and realism of the synthetic models. Several works try to push forward simulation tools for sensing devices and environmental phe nomena. State-of-the-art depth sensor simulators work fairly well for instance, as the mechanisms impairing depth scans have been well studied and can be rather well reproduced. In case of color data however, the problem lies not in the sen sor simulation but in the actual complexity and variability of the color domain (e.g. sensibility to lighting conditions, texture changes with wear-and-tear, etc.) . This makes

it extremely arduous to come up with a satisfactory mapping, unless precise, exhaustive synthetic models are provided (e.g. by capturing realistic textures) . Proper modeling of target classes is however often not enough, as recognition methods would also need information on their environment (background, occlusions, etc.) to be applied to real-life scenarios .

For this reason, and in complement of simulation tools, re cent methods based on convolutional neural networks (CNN) are trying to further bridge the realism gap by learning a map ping from rendered to real data, directly in the image do main. Mostly based on unsupervised conditional generative ad versarial networks (GANs) (such as Bousmalis et al . : "Unsu pervised Pixel—Level Domain Adaption with Generative Adver sarial Networks", arXiv : 1612.05424 ) or style-transfer solu tions, these methods still need a set of real samples to learn their mapping.

There are circumstances, however, where the provision of real samples is impossible or only possible with considerable ef forts .

It is thus an objective of the present invention to provide a recognition system given the constraint that the only input available are texture-less CAD models.

This objective is achieved by the concept disclosed in the independent claims. Advantageous embodiments and variations are described in the dependent claims and the drawings accom panying the description.

According to one aspect of the invention, there is provided a method to train a task-specific recognition network. The ob jective of the task-specific recognition network is to recov er an object from a cluttered image. The recognition network comprises an artificial neural network. The method comprises the following steps:

Receiving synthetic cluttered images as input, wherein the cluttered images are the output of an augmentation pipeline which augments synthetic normal maps into syn thetic cluttered images;

giving a low-dimensional vector as output, wherein the low-dimensional vector describes an aspect of the ob- j ect ;

converting the low-dimensional vector into a normal map by means of a renderer network, wherein the renderer network comprises an artificial neural network;

comparing the normal map given by the renderer network and the normal map fed into the augmentation pipeline; and

optimizing the neural network of the recognition network and the neural network of the renderer network such that a task-specific loss of the recognition network, a seg mentation loss of the renderer network and a foreground loss of the renderer network are minimum.

In the context of this patent application, "recovering" an object comprises recognizing (i.e. determining) the class of the object (sometimes also referred to as the "instance" of the object), its pose relative to the camera, or other prop erties of the object. Note that the present patent applica tion is not directed towards the detection of an object in a cluttered image, wherein "object detection" is understood as figuring out if and where an object is represented somewhere in a larger image. For "object detection", usually a 2D bounding box containing the detected object is returned. For instance, if an image of a green field with several cows in it is given, a "cow-detector" would return bounding boxes for each cow representation in this picture. For "object classi fication", it is assumed that the image contains only one el ement, and the classifier must return the proper label for the whole image (e.g. picture of "cow" or picture of "cat") .

While object detection can be seen as a more difficult task, most of current detection methods are class-specific, i.e. they are trained to detect only one class of objects (e.g. cow-detector, cat-detector) . This means that as many detec- tors as target classes are conventionally needed, and all these different detectors have to be applied for each image (e.g. given an image, apply the cow-detector to spot cows in it, then the cat-detector to spot cats, etc.) .

As mentioned above, the present solution aims, however, at classifying instead of detecting the object (s) in the image. Alternatively or additionally, the present solution is di rected to estimate the pose of the object relative to the camera .

Artificial neural networks (ANN) are computing systems vague ly inspired by the biological neural networks that constitute animal brains. Artificial neural networks "learn" to perform tasks by considering examples, generally without being pro grammed with any task-specific rules.

An ANN is based on a collection of connected units or nodes called artificial neurons which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a sig nal can process it and also generate additional artificial neurons connected to it.

In common ANN implementations, the signal at a connection be tween artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear func tion of the sum of its inputs. The connections between arti ficial neurons are called 'edges'. Artificial neurons and edges typically have a weight that adjusts as learning pro ceeds. The weight increases or decreases the strength of the signal at a connection. Typically, artificial neurons are ag gregated into layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first layer (the input layer) to the last layer (the out put layer) , oftentimes passing through a multitude of hidden layers in-between. A task-specific recognition network trained in a way as de scribed above is advantageously used to recover an object from an unseen, real cluttered image.

A "cluttered" image of an object is understood as an image wherein some kind of disturbance, in other words nuisance, has been added to. The "clutter" comprises, for instance, a background behind the object which is depicted in the image, noise in the illustration of the object and/or the back ground, shadows, surface texture, blurring, rotation, trans lation, flipping or resizing of the object, partial occlu sions of the object and changing the color of the object. In contrast to cluttered images, a normal map or a depth map which is directly obtained from the CAD model and does as such not contain any clutter, is also referred to as a

"clean" normal or depth map, respectively.

A "CAD model" (computer-aided design model) , which is some times also referred to as "CADD model" (computer-aided design and drafting model) , is understood as a design for which cre ation computer systems including workstations have been used. CAD output is often in the form of electronic files for printing, machining, or other manufacturing operations. CAD models may generally be two-dimensional (2D) or three- dimensional (3D) .

A "texture-less CAD model" is understood as a CAD model which only contains pure semantic and geometrical information, but no information regarding e.g. its appearance (color, texture, material type) , scene (position of light sources, cameras, peripheral objects) or animation (how the model moves, if this is the case) . It will be one of the tasks of the augmen tation pipeline to add random appearance or scene features to the clean normal map of the texture-less CAD model. Texture information basically includes the color information, the surface roughness, and the surface shininess for each point of the object's surface. Note that for many 3D models some parts of the objects are only distinguishable because of the changes in the texture information is known for each point of the object's surface.

Note that the constraint of only reverting to texture-less CAD models during the training of the recognition network is challenging. To solve this task successfully, an advanced augmentation pipeline which applies random textures to the generated images is advantageously used. A benefit of train ing the recognition network with random textures instead of being able to rely on real texture information is, however, that the recognition network is assumed to be robust against by changes in the texture of the real objects (e.g. caused by dirtiness, wear-and-tear, repainting, etc.). This property is also referred to as being "texture-blind". Thus, the overall color of objects, but also the changes/ irregularities in their texture or between their different parts, is usually a really useful information for recognition. Hence the tricki ness of training recognition methods without.

A normal map is a representation of the surface normals of a 3D model from a particular viewpoint, stored in a two- dimensional colored image, also referred to as an RGB (i.e. red color/ green color/ blue color) image. Herein each color corresponds to the orientation of the surface normal. A "nor mal map" therefore creates the impression of a three- dimensional image, but only occupies little storage space. Normal mapping, sometimes also referred to as "Dot3 bump map ping" is a known technique from 3D computer graphics. It is primarily used in video games.

"Augmenting" refers to the transformation of a normal map in to a color image by means of adding certain types of nuisanc es (or "clutter") to the clean normal map. Note that by transforming a normal map into a color image, on the one hand information is lost as no more precise 3D representation of the object surface is present at the color image. On the oth er hand, external information is added to the image, due to e.g. lighting condition and texture information. If, however, the added information is randomly defined, it is more noise than semantic information.

One important aspect of the present invention is that the claimed method does not target to build realistic images from the texture-less CAD models for the training of the task- specific recognition network. Instead, the recognition net work is purely trained on synthetic data, namely synthetic normal maps, which are directly obtained from the CAD models.

Another aspect is that a normal map is created from the tex- ture-less input CAD model during the training of the recogni tion network. The creation of normal maps instead of images has the enormous advantage that this can be carried out by the central processor unit (CPU) of the recognition system instead of by the GPU. The consequence is that the created normal maps do not need to be stored separately. Instead, they can be used directly by the recognition network. There fore, the generation and processing of the normal maps from the CAD models according to the invention can be referred to as an "online" process, while the conventional process of generating and processing images from CAD models can be re ferred to as an "offline" process.

In the following, the recognition network and the renderer network are described in more detail.

The task of the recognition network is to estimate an aspect of the object such as its class (in other words its category or instance) or its pose relative to the camera. The result is output as a low-dimensional vector. The estimation is achieved by using an artificial neural network with a multi tude of layers. The ANN can, for instance, be designed as a convolutional neural network.

The deviation of the estimated class with the real class, the so-called ground truth class, is quantified by a value which is referred to as the class loss. Likewise, the deviation of the estimated pose with the real pose, i.e. the ground truth pose, is quantified by a value which is referred to as the pose loss. It is desired to minimize these losses during op timization of the recognition network.

A task-specific recognition of certain aspects of the ob ject (s) which are depicted in the cluttered images can be achieved with a certain accuracy by applying the recognition network as described above.

The inventors of the present invention have realized, though, that the provision of a second artificial neural network act ing as a refinement of the recognition network may greatly enhance the accuracy of the results. Therefore, a renderer network comprising an artificial neural network is provided. The renderer network takes the low-dimensional vector given by the recognition network as input and converts it to a nor mal map. The conversion is carried out and supported by a trained ANN. Training of the renderer network is realized by comparing the normal map which is output by the renderer net work with the normal map which initially has been fed into the augmentation pipeline. In a more detailed and more formal way, two further losses are introduced, namely a segmentation loss and a foreground loss. The segmentation loss describes whether the contour (or shape) of the object is correctly predicted by the normal map given by the renderer network.

The segmentation loss describes, in other words, whether the object is correctly distinguished in contrast to the back ground .

The foreground loss describes whether the normal vectors of the object correctly represent the ground truth shape of the respective object.

By introducing these two further losses and feeding this in put back to the recognition network, the recognition network learns to estimate even more correctly the task which is set to it . In an embodiment of the invention, yet another loss mechanism is introduced: the triplet loss. For this purpose, three im ages are compared with each other, an anchor image, a puller image and a pusher image. The anchor image is the image which shall be evaluated by the recognition network. It represents, i.e. illustrates a specific object which needs to be recog nized. The puller image comprises an object which is similar to the anchor image, while the pusher image depicts an object which is clearly different in class and/or pose in relation to the anchor image. The recognition network produces a fea ture vector for each of the images. The task of the recogni tion network is to increase the Euclidian distance between the feature vectors of the anchor image and the pusher image, as well as to decrease the Euclidian distance between the feature vectors of the anchor image and the puller image. In other words, the recognition network shall output closely- spaced, i.e. similar, feature vectors when comparing the an chor and puller image. In contrast, if the anchor image and pusher image are given to the network, it shall group them distinctly apart from each other by means of attributing clearly different feature vectors.

An advantage of evaluating these triplets and minimizing ad ditionally the triplet loss is that the task given to the recognition network is even solved more accurately.

After the recognition network has been trained, it can be used to recover an object from a real, unseen cluttered im age. The recognition network has the potential to determine the required features, e.g. class or pose of the depicted ob ject, with a high accuracy. It has to be borne in mind, how ever, that the training needs in principle to be carried out again if new objects with new CAD models as input shall be recovered by the network. In particular, the object is directly recovered from the cluttered image by applying the trained task-specific recog nition network on the cluttered image.

If, exemplarily, the task of the recognition network is to estimate the pose of the depicted object relative to the cam era, a "direct recovery" means that the low-dimensional vec tor, which is also referred to as the "feature vector", is in particular not compared to a "codebook", i.e. a dictionary of feature vectors with their pose label generated at training time. Instead, the recognition network directly processes the images patches to return the explicit class and pose. Conse quently, there is no need to generate and store a huge code book for each class.

Embodiments of the invention are now described, by way of ex ample only, with the help of the accompanying drawings, of which :

Figure 1 shows a recognition system according to the prior art; and

Figure 2 shows a recognition system according to an embodi ment of the invention.

Figure 1 illustrates a method for recognizing an object from an image according to the prior art. In a first phase, a recognition system T' is trained. Therefore, this phase is referred to as the training phase 110. After the training has finished, in a second phase, the trained recognition system T' is used for recognizing an object from a cluttered image 121, which is unknown to the recognition system and which is a real, cluttered image. Therefore, the second phase is re ferred to as the use phase 120.

During the training phase 110, synthetic cluttered images 112 are fed into the recognition system T'. The cluttered images 112 are obtained from texture-less CAD models 111. The crea- tion of the cluttered images 112 based on the CAD model 111 is carried out by a graphics processor unit (GPU) , a proces sor which is designed for creating graphics, i.e. images, purely from CAD model data. The images are stored at a memory space of the recognition system.

Note that the cluttered image 112 does not only display the object of the CAD model 111 as such. Generally, a texture and a color are given to the object; shading due to a simulated lighting of the object is considered; the object may be part ly occluded; there may be displayed other objects in the same image; the entire image contains noise; and the image gener ally contains a background. Therefore, the image is referred to as a cluttered image 112. The cluttering may be chosen fully randomly; certain constraints, e.g. for the occlusions or the noise, are, however, possible.

For every object which shall be accurately recognized by the recognition network in the use phase, a significant number of cluttered images are simulated by the GPU. The perspective, from which the object is seen, is identical for each simulat ed image in the first place; however, the "clutter", i.e. the background, lighting, noise, etc. is different for every im age .

In addition, the perspective from which the object is seen is changed. A hemisphere around and above the CAD model of the object is virtually created and a desired number of view points are defined. For each viewpoint, i.e. for each per spective, a significant number of cluttered images are simu lated by the GPU, as described above. By this procedure, a large number of images depicting the same object from differ ent viewpoints with different "clutter" is obtained.

The recognition network T' analyzes the synthetic cluttered images 112, wherein a specific task is set to the recognition network. For instance, the task could be to recognize the na ture, i.e. the class or category, of the object, e.g. whether the object being depicted in the cluttered image is a cow, a cat or a cactus. In this case, the recognition network needs to be trained with the CAD models of all mentioned objects (here, cow, cat and cactus) . Another task for the recognition network could be to identify the pose of the object, namely whether the object is depicted in a top view, from the front, the back or from one of the side (in case that the object has well-defined front, back, top and bottom sides) . As the algo rithm of the recognition network depends on the task that the recognition network is expected to solve during the use phase, the recognition network is also referred to as the task-specific recognition network T'.

Note that a drawback of the described prior art concept is that every generated image needs to be stored at a memory space of the recognition system. After being stored in the system, it can immediately be fed into the recognition net work. Alternatively, this can be done after all images have been created.

The recognition network T' is trained in a supervised manner. It must make its decision regarding the task given to it and transmit or display its output 113, e.g. the class or the pose of the object. As the recognition system inherently knows the solution to the task, the output 113 can be evalu ated automatically. Thus, the evaluation of the accuracy of the recognition system can be carried out by the recognition system itself.

After the recognition network T' is trained to a sufficient degree, the use phase 120 can commence. Herein, images 121, which are unknown to the recognition network T', are given as an input to the recognition network T'. Obviously, the images are cluttered, and the images are real instead of synthetic. However, due to the training phase 110 of the recognition network T', a reasonable accuracy of the recognition network

T' can be achieved. Still, the already mentioned drawbacks persist - a finite number of training data, which need to be stored at a storage site separately; and an accuracy which is not optimum.

Figure 2 illustrates an exemplary embodiment of the inventive concept. In a first phase, the training phase 210, a task- specific recognition network G_rec and a renderer network G_ren are trained for solving a specific task, e.g. recognizing the class or the pose of an object. In a second phase, the use phase 220, an unseen, real cluttered image 221 is evaluated by the trained recognition network G_rec with the help of the trained renderer network G_ren · As a result, an output 222 is issued representing the solution of the task given to the recognition network G_recr e.g. to recognize and identify the nature and/or a specific feature of the object displayed in the real cluttered input image 221.

Note that one step of interest is the generation of normal maps when rendering color images from 3D models. In other words, the process of generating color images is made light weight by separating the 3D projection step (i.e. computing the projected geometry as seen from target viewpoints) in terms of the generation of normal maps and the augmentation step, namely computing the color representation. The intro duction of normal maps as an intermediary dataset has the ad vantage that the GPU-intensive step of directly rendering augmented images from 3D models is substituted by two sepa rate steps wherein the second step, i.e. the conversion from normal maps into augmented images, can be carried out by the CPU. In particular, the first step of converting the geomet ric information of the CAD model into a normal map can in principle be performed beforehand, while the step of augment ing the clean normal map can be done online, for instance in parallel of training a recognition unit which uses the aug mented images as input data.

The recognition network G_rec is trained for a specific task. Exemplary tasks are the identification of the class or the pose of the object being depicted in the normal map 212. The recognition network G_rec gives as output 222 the corresponding solution to the given task.

The training of the recognition network G_rec is performed in a supervised manner. As the recognition system "knows" the so lution of the task, i.e. as it knows the class or pose of the object, which is transformed into the normal map 212 and sub sequently fed into the recognition network G_recr it can cor rect or confirm the output 222 of the recognition network G_rec· Thus, the recognition network G_rec learns by itself and without human interaction.

It is worth to mention that the recognition network G_rec can in principle be trained by an unlimited number of training data. As the training occurs "on the fly", in other words "online", no library of training images need to be built up, in contrast to the prior art method explained above where this is mandatory. Therefore, the training of the recognition network G_rec is sometimes referred to as being carried out on an "infinite" number of training data.

Note that, in contrast to what could be imagined intuitively, it is not targeted to generate an image as realistic as pos sible from the texture-less input CAD model. Moreover, the recognition unit shall, figuratively speaking, become "tex ture-blind", which means that the object shall be recognized in the cluttered image irrespective of the background, the shading, eventual occlusion, etc.

Further note, that, in addition, the perspective from which the object is seen is changed. A hemisphere around and above the CAD model of the object is virtually created and a de sired number of viewpoints are defined. For each viewpoint, i.e. for each perspective, a significant number of cluttered images are simulated by the CPU. By this procedure, a large number of images depicting the same object from different viewpoints with different "clutter" is obtained. During the training phase 210, the synthetic normal maps 212 are converted into synthetic cluttered images 213 via an aug mentation pipeline A. The augmentation pipeline augments the received normal map by adding texture, noise, partial occlu sions, etc. and at the same time converting the normal maps into color images. An example of an augmentation pipeline is given in Marcus D. Bloice, Christof Stocker and Andreas Hol- zinger: "Augmentor: An Image Augmentation Library for Machine Learning", arXiv: 1708.04680vl .

After both the recognition network G_rec and the renderer net work G_ren are trained, the recognition network G_rec can be used in "real life". During the use phase 220, an unseen, real cluttered image 221 of an object is given to the recognition network G_recr which gives the required output 222, e.g. the class and/or the pose of the object.

During the training, a task-specific loss, which could be a class loss L_c or a pose loss L_p or both is minimized. Option ally, a triplet loss L_t is also minimized during the optimi zation. The minimization of these losses directly improves the neural network of the recognition network G_rec · In addi tion, the normal map 214 generated by the renderer network G_ren is compared with the normal map 212 which has initially be generated based on the input CAD model 211.

In practice, the comparison between the two normal maps 212, 214 involves the minimization of a segmentation loss L_g and a foreground loss L_f. The minimization of L_g and L_f does not on ly optimize the neural network of the renderer network G_ren, but also improves the recognition network G_rec · As a result, an improved accuracy of the output 222 of the recognition network G_rec is achieved, once that the system is used for the recovery of real, unseen images.

Claims

1. Method to train a task-specific recognition network (G_rec) for recovering an object from a cluttered image (221), where in the recognition network (G_rec) comprises an artificial neu ral network and the method comprises the following steps: receiving synthetic cluttered images (213) as input, wherein the cluttered images (213) are the output of an augmentation pipeline (A) which augments synthetic nor mal maps (212) into synthetic cluttered images (213), giving a low-dimensional vector as output, wherein the low-dimensional vector describes an aspect of the ob- j ect ,

converting the low-dimensional vector into a normal map (214) by means of a renderer network (G_ren) , wherein the renderer network (G_ren) comprises an artificial neural network,

comparing the normal map (214) given by the renderer network (G_ren) and the normal map (212) fed into the aug mentation pipeline (A) , and

optimizing the neural network of the recognition network (G_rec) and the neural network of the renderer network (G_ren) such that a task-specific loss of the recognition network (G_rec) a segmentation loss (L_g) of the renderer network (G_ren) and a foreground loss (L_f) of the renderer network (G_ren) are minimum.

2. Method according to claim 1,

wherein the low-dimensional vector describes a class of the obj ect .

3. Method according to claim 1,

wherein the low-dimensional vector describes a pose of the object relative to the camera.

4. Method according to one of the preceding claims, wherein the synthetic normal maps (212) which are the input of the augmentation pipeline (A) are obtained from texture less CAD models.

5. Method according to one of the preceding claims,

wherein the training of the task-specific recognition network (G_rec) is purely based on texture-less CAD models.

6. Method according to one of the preceding claims,

wherein the neural network of the recognition network (G_rec) and the neural network of the renderer network (G_ren) are fur ther optimized by minimizing a triplet loss (L_t) .

7. Method according to claim 6,

wherein a multiple of three normal maps (212) is fed into the augmentation pipeline (A) .

8. Method to recover an object from a cluttered image (221) by means of a task-specific recognition network (G_rec) being trained according to one of the preceding claims.

9. Method according to claim 8,

wherein the object is directly recovered from the cluttered image (221) by applying the trained task-specific recognition network (G_rec) on the cluttered image (221) .

10. A recognition system for recovering an object from a cluttered image (221), the recognition system comprising a task-specific recognition network (G_rec) being trained accord ing to one of the claims 1 to 7.

11. A computer program product comprising instructions which, when the program is executed by a computer, cause the comput er to carry out the steps of a method according to one of the preceding claims.

12. A computer-readable storage medium comprising instruc tions which, when executed by a computer, cause the computer to carry out the steps of a method according to one of the preceding claims.