DE102021202337A1

DE102021202337A1 - Device and method for training a machine learning model for recognizing an object topology of an object in an image of the object

Info

Publication number: DE102021202337A1
Application number: DE102021202337.1A
Authority: DE
Inventors: Markus Spies
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2022-09-15
Also published as: CN115081635A

Abstract

Gemäß verschiedenen Ausführungsformen wird ein Verfahren zum Trainieren eines maschinellen Lernmodells beschrieben, das aufweist: Erzeugen von Trainingsdatenbildpaaren, wobei jedes Trainingsdatenbildpaar ein ein Objekt zeigendes Trainings-Kamerabild und ein Deskriptorwert-Zielbild aufweist und wobei das Erzeugen des Trainings-Kamerabilds und des Deskriptorwert-Zielbilds Auswählen einer Position für das Objekt relativ zu einer vorgegebenen Kamera, Auswählen einer Orientierung für das Objekt relativ zu der vorgegebenen Kamera, Anpassen der Orientierung zu einer für die ausgewählte Position und die ausgewählte Orientierung vorgegebene Referenzorientierung, bei der das Objekt die gleiche Projektion auf die Bildebene der Kamera wie bei der ausgewählten Orientierung hat, Erzeugen des Trainings-Kamerabilds durch Erzeugen eines Kamerabilds des Objekts mit der ausgewählten Position und der Referenzorientierung aus Sicht der Kamera und Erzeugen des Deskriptorwert-Zielbilds aufweist, und Trainieren des maschinellen Lernmodells durch überwachtes Lernen unter Verwendung der Trainingsdatenbildpaare als Trainingsdaten zur Erzeugung von Deskriptorwert-Bildern für zugeführte Kamerabilder.According to various embodiments, a method for training a machine learning model is described, comprising: generating training data image pairs, each training data image pair having a training camera image showing an object and a descriptor value target image, and wherein generating the training camera image and the descriptor value target image selects a position for the object relative to a given camera, selecting an orientation for the object relative to the given camera, adjusting the orientation to a reference orientation given for the selected position and orientation, in which the object has the same projection onto the image plane of the camera as at the selected orientation, generating the training camera image by generating a camera image of the object with the selected position and the reference orientation as seen by the camera and generating the descriptor value target image, and training en the machine learning model by supervised learning using the training data image pairs as training data to generate descriptor value images for input camera images.

Description

Verschiedene Ausführungsbeispiele betreffen allgemein eine Vorrichtung und ein Verfahren zum Trainieren eines maschinellen Lernmodells zum Erkennen einer Objekttopologie eines Objekts in einem Bild des Objekts.Various example embodiments relate generally to an apparatus and method for training a machine learning model to recognize an object topology of an object in an image of the object.

Um eine flexible Herstellung oder Bearbeitung von Objekten durch einen Roboter zu ermöglichen, ist es wünschenswert, dass der Roboter in der Lage ist, ein Objekt ungeachtet der Stellung, mit der das Objekt in den Arbeitsraum des Roboters platziert wird, zu handhaben. Daher sollte der Roboter fähig sein, zu erkennen, welche Teile des Objektes sich an welchen Positionen befinden, sodass er zum Beispiel das Objekt an der korrekten Stelle greifen kann, um es z. B. an einem anderen Objekt zu befestigen, oder das Objekt am aktuellen Ort zu schweißen. Dies bedeutet, dass der Roboter fähig sein sollte, die Pose (Position und Orientierung) des Objekts zum Beispiel aus einem oder mehreren Bildern, die durch eine am Roboter befestigte Kamera aufgenommen werden, zu erkennen. Ein Ansatz zum Erzielen davon besteht darin, Deskriptoren, d. h. Punkte (Vektoren) in einem vordefinierten Deskriptorraum, für Teile des Objekts (d. h. in einer Bildebene repräsentierte Pixel des Objekts) zu bestimmen, wobei der Roboter trainiert ist, unabhängig von einer aktuellen Pose des Objekts den gleichen Teilen eines Objekts die gleichen Deskriptoren zuzuweisen und somit die Topologie des Objekts in dem Bild zu erkennen, sodass dann beispielsweise bekannt ist, wo sich welche Ecke des Objekts in dem Bild befindet. Bei Kenntnis der Pose der Kamera lässt sich dann wiederum auf die Pose des Objekts rückschließen. Das Erkennen der Topologie lässt sich mit einem maschinellen Lernmodell realisieren, das entsprechend trainiert wird. Es sind effiziente Verfahren zum Trainieren solcher maschineller Lernmodelle wünschenswert.In order to enable flexible production or processing of objects by a robot, it is desirable for the robot to be able to handle an object regardless of the posture with which the object is placed in the working space of the robot. Therefore, the robot should be able to recognize which parts of the object are in which positions, so that it can, for example, grab the object in the correct place to e.g. B. to attach to another object, or to weld the object at the current location. This means that the robot should be able to recognize the pose (position and orientation) of the object, for example, from one or more images captured by a camera attached to the robot. One approach to achieve this is to use descriptors, i. H. To determine points (vectors) in a predefined descriptor space, for parts of the object (i.e. pixels of the object represented in an image plane), the robot being trained to assign the same descriptors to the same parts of an object, regardless of a current pose of the object, and thus to recognize the topology of the object in the image, so that it is then known, for example, where each corner of the object is located in the image. If the pose of the camera is known, the pose of the object can then in turn be inferred. The recognition of the topology can be realized with a machine learning model that is trained accordingly. Efficient methods for training such machine learning models are desirable.

Gemäß verschiedenen Ausführungsformen wird ein Verfahren zum Trainieren eines maschinellen Lernmodells zum Erkennen einer Objekttopologie eines Objekts in einem Bild des Objekts bereitgestellt, das aufweist: Ermitteln eines 3D-Modells eines Objekts , wobei das 3D-Modell ein Gitter von Vertices aufweist, Bestimmen eines Deskriptorwerts für jeden Vertex des Gitters, Erzeugen von Trainingsdatenbildpaaren, wobei jedes Trainingsdatenbildpaar ein das Objekt zeigendes Trainings-Kamerabild und ein Deskriptorwert-Zielbild aufweist und wobei das Erzeugen des Trainings-Kamerabilds und des Deskriptorwert-Zielbilds Auswählen einer Position für das Objekt relativ zu einer vorgegebenen Kamera, Auswählen einer Orientierung für das Objekt relativ zu der vorgegebenen Kamera, Anpassen der Orientierung zu einer für die ausgewählte Position und die ausgewählte Orientierung vorgegebene Referenzorientierung, bei der das Objekt die gleiche Projektion auf die Bildebene der Kamera wie bei der ausgewählten Orientierung hat, Erzeugen des Trainings-Kamerabilds durch Erzeugen eines Kamerabilds des Objekts mit der ausgewählten Position und der Referenzorientierung aus Sicht der Kamera und Erzeugen des Deskriptorwert-Zielbilds durch Zuweisen, für jede Vertexposition des Objekts in dem Trainings-Kamerabild, des für den Vertex an der Vertexposition bestimmten Deskriptorwerts zu der Position in dem Deskriptorwert-Zielbild aufweist, und Trainieren des maschinellen Lernmodells durch überwachtes Lernen unter Verwendung der Trainingsdatenbildpaare als Trainingsdaten zur Erzeugung von Deskriptorwert-Bildern für zugeführte Kamerabilder.According to various embodiments, there is provided a method for training a machine learning model to recognize an object topology of an object in an image of the object, comprising: determining a 3D model of an object, the 3D model having a grid of vertices, determining a descriptor value for each vertex of the grid, generating pairs of training data images, each pair of training data images comprising a training camera image showing the object and a descriptor value target image, and wherein generating the training camera image and the descriptor value target image includes selecting a position for the object relative to a given camera, Selecting an orientation for the object relative to the specified camera, adjusting the orientation to a reference orientation specified for the selected position and orientation, in which the object has the same projection onto the image plane of the camera as in the selected orientation tion, generating the training camera image by generating a camera image of the object with the selected position and the reference orientation as seen by the camera, and generating the descriptor value target image by assigning, for each vertex position of the object in the training camera image, the for the vertex the vertex position of the determined descriptor value to the position in the descriptor value target image, and training the machine learning model by supervised learning using the training data image pairs as training data to generate descriptor value images for input camera images.

Das Anpassen der Orientierung auf eine Referenzorientierung liefert eindeutige Deskriptorwert-Zielbilder für das Trainieren des maschinellen Lernmodells. Somit ermöglicht das oben beschriene Verfahren ein effizientes und robustes Training von maschinellen Lernmodellen, die eine Objekttopologie eines Objekts in einem Bild des Objekts erkennen (indem sie z.B. eine Merkmalskarte, d.h. ein Deskriptorbild, mit (Pixel-)Werten in einem Deskriptorraum erzeugen). Dies kann dann dazu verwendet werden, eine Pose des Objekts im dreidimensionalen Raum zu ermitteln und letztlich einen Roboters zum Behandeln des Objekts zu steuern.Matching the orientation to a reference orientation provides unique descriptor value target images for training the machine learning model. Thus, the method described above enables efficient and robust training of machine learning models that recognize an object topology of an object in an image of the object (e.g. by generating a feature map, i.e. a descriptor image, with (pixel) values in a descriptor space). This can then be used to determine a pose of the object in three-dimensional space and ultimately to control a robot to treat the object.

Ausführungsbeispiel 1 ist ein Verfahren zum Trainieren eines maschinellen Lernmodells zum Erkennen einer Objekttopologie eines Objekts aus einem Kamerabild des Objekts wie oben beschrieben.Embodiment 1 is a method for training a machine learning model for recognizing an object topology of an object from a camera image of the object as described above.

Ausführungsbeispiel 2 ist das Verfahren gemäß Ausführungsbeispiel 1, wobei die Referenzorientierung für das Objekt gegeben ist durch Ausrichten, für jede Symmetrieachse des Objekts, einer Achse des Objekts, die auf der Symmetrieachse senkrecht steht, auf eine vorgegebene Referenzachse.Embodiment 2 is the method according to embodiment 1, wherein the reference orientation for the object is given by aligning, for each axis of symmetry of the object, an axis of the object that is perpendicular to the axis of symmetry to a predetermined reference axis.

Das Ausrichten von Achsen ermöglicht eine einfache Festlegung der ReferenzorientierungAligning axes allows easy setting of the reference orientation

Ausführungsbeispiel 3 ist das Verfahren gemäß Ausführungsbeispiel 2, wobei die Referenzachse gegeben durch die Richtung der Kamera aus Sicht des Objekts ist.Embodiment 3 is the method according to embodiment 2, wherein the reference axis is given by the direction of the camera as seen from the object.

Diese Referenzachse lässt sich einfach ermitteln und bildet so eine einfach verfügbare Ermittlung der Referenzorientierung.This reference axis can be easily determined and thus forms a readily available determination of the reference orientation.

Ausführungsbeispiel 4 ist das Verfahren nach einem der Ausführungsbeispiele 1 bis 3, wobei Trainingsdatenbildpaare für eine Vielzahl von unterschiedlichen Positionen, die ausgewählt werden, und eine Vielzahl von unterschiedlichen Orientierungen, die ausgewählt werden, erzeugt werden.Exemplary embodiment 4 is the method according to one of exemplary embodiments 1 to 3, wherein training data image pairs for a multiplicity of different positions which are selected and a plurality of different orientations selected are generated.

Dies ermöglicht das Trainieren des maschinellen Lernmodells (z. B. eines Roboters mit einer Robotersteuereinrichtung, die das maschinelle Lernmodell implementiert), um die Topologie eines Objekts ungeachtet der Stellung (d.h. Pose) des Objekts z. B. im Arbeitsraum des Roboters zu erkennen.This allows the machine learning model (e.g. a robot with a robot controller implementing the machine learning model) to be trained to understand the topology of an object regardless of the posture (i.e. pose) of the object e.g. B. to recognize in the workspace of the robot.

Ausführungsbeispiel 5 ist das Verfahren zum Steuern eines Roboters, das Folgendes aufweist: Trainieren eines maschinellen Lernmodells gemäß einem der Ausführungsbeispiele 1 bis 4, Erhalten eines das Objekt zeigenden Kamerabildes, Zuführen des Kamerabildes zu dem maschinellen Lernmodell, Bestimmen einer Pose des Objekts aus der Ausgabe des maschinellen Lernmodells und Steuern des Roboters in Abhängigkeit von der bestimmten Stellung des Objekts.Embodiment 5 is the method for controlling a robot, comprising: training a machine learning model according to any one of embodiments 1 to 4, obtaining a camera image showing the object, supplying the camera image to the machine learning model, determining a pose of the object from the output of the machine learning model and controlling the robot depending on the determined pose of the object.

Ausführungsbeispiel 6 ist das Verfahren nach Ausführungsbeispiel 5, wobei das Bestimmen der Pose des Objekts das Ermitteln von Vertexpositionen des Objekts im Kamerabild unter Verwendung des für das Kamerabild von dem maschinellen Lernmodell ausgegebenen Deskriptorwert-Bilds aufweist.Embodiment 6 is the method of embodiment 5, wherein determining the pose of the object comprises determining vertex positions of the object in the camera image using the descriptor value image output from the machine learning model for the camera image.

Das Deskriptorwert-Bild, das von dem wie oben beschriebene trainierten maschinellen Lernmodell erzeugt wurde, ermöglicht eine effektive Ermittlung der Pose mit geringem Aufwand, z.B. unter Verwendung eines PnP-Lösungsverfahrens.The descriptor value image generated by the machine learning model trained as described above allows the pose to be determined effectively with little effort, e.g. using a PnP solving method.

Ausführungsbeispiel 7 ist das Verfahren nach Ausführungsbeispiel 5 oder 6, wobei das Bestimmen der Pose des Objekts das Bestimmen der Position eines bestimmten Teils des Objekts aufweist, und wobei das Steuern des Roboters in Abhängigkeit von der bestimmten Stellung des Objekts Steuern eines Endeffektors des Roboters, sich zu der Position des Teils des Objekts zu bewegen und mit dem Teil des Objekts zu interagieren, aufweist.Embodiment 7 is the method of embodiment 5 or 6, wherein determining the pose of the object comprises determining the position of a specific part of the object, and controlling the robot depending on the determined pose of the object includes controlling an end effector of the robot, itself to move to the position of the part of the object and to interact with the part of the object.

Ausführungsbeispiel 8 ist ein Software- oder Hardware-Agent, insbesondere Roboter, der Folgendes aufweist: eine Kamera, die zum Bereitstellen von Bilddaten eines Objekts eingerichtet ist, eine Steuereinrichtung, die zum Implementieren eines maschinellen Lernmodells eingerichtet ist und eine Trainingsvorrichtung, die zum Trainieren des maschinellen Lernmodells mittels des Verfahrens nach einem der Ausführungsbeispiele 1 bis 7 eingerichtet ist.Embodiment 8 is a software or hardware agent, in particular a robot, comprising the following: a camera set up to provide image data of an object, a controller set up to implement a machine learning model, and a training device set up to train the machine learning model is set up by means of the method according to one of the exemplary embodiments 1 to 7.

Ausführungsbeispiel 9 ist der Software- oder Hardware-Agent nach Ausführungsbeispiel 8, der mindestens einen Aktor aufweist, wobei die Steuereinrichtung zum Steuern des mindestens einen Aktor unter Verwendung einer Ausgabe von dem maschinellen Lernmodell eingerichtet ist.Embodiment 9 is the software or hardware agent of embodiment 8 having at least one actor, wherein the controller is configured to control the at least one actor using an output from the machine learning model.

Ausführungsbeispiel 10 ist ein Computerprogramm, das Anweisungen umfasst, die bei Ausführung durch einen Prozessor veranlassen, dass der Prozessor ein Verfahren nach einem der Ausführungsbeispiele 1 bis 7 durchführt.Embodiment 10 is a computer program that includes instructions that, when executed by a processor, cause the processor to perform a method according to any one of Embodiments 1-7.

Ausführungsbeispiel 11 ist ein computerlesbares Medium, das Anweisungen speichert, die bei Ausführung durch einen Prozessor veranlassen, dass der Prozessor ein Verfahren nach einem der Ausführungsbeispiele 1 bis 7 durchführt.Embodiment 11 is a computer-readable medium that stores instructions that, when executed by a processor, cause the processor to perform a method according to any one of Embodiments 1-7.

Ausführungsbeispiele der Erfindung sind in den Figuren dargestellt und werden im Folgenden näher erläutert. In den Zeichnungen beziehen sich gleiche Bezugszeichen überall in den mehreren Ansichten allgemein auf dieselben Teile. Die Zeichnungen sind nicht notwendig maßstabsgerecht, wobei der Schwerpunkt stattdessen allgemein auf die Darstellung der Prinzipien der Erfindung liegt.

1 zeigt einen Roboter.
2 veranschaulicht das Ermitteln des Pose eines Objekts aus einem Kamerabild des Objekts.
3 veranschaulicht das Trainieren eines neuronalen Netzes gemäß einer Ausführungsform.
4 veranschaulicht das Problem symmetrischer Objekte beim Training am Beispiel eines Würfels.
5 zeigt ein erstes Beispiel für Deskriptorbilder eines Objekts in einer Ausgangspose und einer Referenzpose.
6 zeigt ein zweites Beispiel für Deskriptorbilder eines Objekts in einer Ausgangspose und einer Referenzpose.
7 zeigt ein Ablaufdiagramm für ein Verfahren zum Trainieren eines maschinellen Lernmodells zum Erkennen einer Objekttopologie eines Objekts in einem Bild des Objekts gemäß einer Ausführungsform.

Exemplary embodiments of the invention are shown in the figures and are explained in more detail below. In the drawings, like reference characters generally refer to the same parts throughout the several views. The drawings are not necessarily to scale, emphasis instead being generally placed upon illustrating the principles of the invention.

1 shows a robot.
2 illustrates determining the pose of an object from a camera image of the object.
3 12 illustrates training a neural network according to one embodiment.
4 illustrates the problem of symmetric objects in training using a cube as an example.
5 shows a first example of descriptor images of an object in an initial pose and a reference pose.
6 Figure 12 shows a second example of descriptor images of an object in an initial pose and a reference pose.
7 FIG. 12 shows a flow chart for a method for training a machine learning model for recognizing an object topology of an object in an image of the object according to an embodiment.

Die verschiedenen Ausführungsformen, insbesondere die im Folgenden beschriebenen Ausführungsbeispiele, können mittels ein oder mehrerer Schaltungen implementiert werden. In einer Ausführungsform kann eine „Schaltung“ als jede Art von Logikimplementierender Entität verstanden werden, welche Hardware, Software, Firmware oder eine Kombination davon sein kann. Daher kann in einer Ausführungsform eine „Schaltung“ eine hartverdrahtete Logikschaltung oder eine programmierbare Logikschaltung, wie beispielsweise ein programmierbarer Prozessor, zum Beispiel ein Mikroprozessor sein. Eine „Schaltung“ kann auch Software sein, die von einem Prozessor implementiert bzw. ausgeführt wird, zum Beispiel jede Art von Computerprogramm. Jede andere Art der Implementierung der jeweiligen Funktionen, die im Folgenden ausführlicher beschrieben werden, kann in Übereinstimmung mit einer alternativen Ausführungsform als eine „Schaltung“ verstanden werden.The various embodiments, in particular the exemplary embodiments described below, can be implemented using one or more circuits. In one embodiment, a "circuit" can be understood as any type of logic implementing entity, which can be hardware, software, firmware, or a combination thereof. Therefore, in one embodiment, a "circuit" may be a hardwired logic circuit or a programmable logic circuit, such as a programmable processor, for example a microprocessor. A “circuit” can also be software that is implemented or executed by a processor, for example any type of computer program. Any other way of implementing the respective functions, which are described in more detail below, can be understood as a "circuit" in accordance with an alternative embodiment.

1 zeigt einen Roboter 100. 1 shows a robot 100.

Der Roboter 100 beinhaltet einen Roboterarm 101, zum Beispiel einen Industrieroboterarm zum Handhaben oder Montieren eines Arbeitsstücks (oder eines oder mehrerer anderer Objekte). Der Roboterarm 101 beinhaltet Manipulatoren 102, 103, 104 und eine Basis (oder Stütze) 105, mittels der die Manipulatoren 102, 103, 104 gestützt werden. Der Ausdruck „Manipulator“ bezieht sich auf die bewegbaren Bauteile des Roboterarms 101, deren Betätigung eine physische Interaktion mit der Umgebung ermöglicht, um z. B. eine Aufgabe auszuführen. Zur Steuerung beinhaltet der Roboter 100 eine (Roboter-) Steuereinrichtung 106, die zum Implementieren der Interaktion mit der Umgebung gemäß einem Steuerprogramm ausgelegt ist. Das letzte Bauteil 104 (das am weitesten von der Stütze 105 entfernt ist) der Manipulatoren 102, 103, 104 wird auch als der Endeffektor 104 bezeichnet und kann ein oder mehrere Werkzeuge beinhalten, wie etwa einen Schweißbrenner, ein Greifinstrument, ein Lackiergerät oder dergleichen.The robot 100 includes a robotic arm 101, for example an industrial robotic arm, for manipulating or assembling a work piece (or other object(s)). The robot arm 101 includes manipulators 102, 103, 104 and a base (or support) 105 by which the manipulators 102, 103, 104 are supported. The term "manipulator" refers to the movable components of the robotic arm 101, the operation of which enables physical interaction with the environment, e.g. B. to perform a task. For control, the robot 100 includes a (robot) controller 106, which is designed to implement the interaction with the environment according to a control program. The final component 104 (farthest from the support 105) of the manipulators 102, 103, 104 is also referred to as the end effector 104 and may include one or more tools, such as a welding torch, gripping instrument, paint gun, or the like.

Die anderen Manipulatoren 102, 103 (die sich näher an der Stütze 105 befinden) können eine Positionierungsvorrichtung bilden, sodass, zusammen mit dem Endeffektor 104, der Roboterarm 101 mit dem Endeffektor 104 an seinem Ende bereitgestellt ist. Der Roboterarm 101 ist ein mechanischer Arm, der ähnliche Funktionen wie ein menschlicher Arm bereitstellen kann (möglicherweise mit einem Werkzeug an seinem Ende).The other manipulators 102, 103 (which are closer to the support 105) can form a positioning device so that, together with the end effector 104, the robot arm 101 is provided with the end effector 104 at its end. The robotic arm 101 is a mechanical arm that can provide functions similar to a human arm (possibly with a tool at its end).

Der Roboterarm 101 kann Gelenkelemente 107, 108, 109 beinhalten, die die Manipulatoren 102, 103, 104 miteinander und mit der Stütze 105 verbinden. Ein Gelenkelement 107, 108, 109 kann ein oder mehrere Gelenke aufweisen, die jeweils eine drehbare Bewegung (d. h. Drehbewegung) und/oder translatorische Bewegung (d. h. Verlagerung) für assoziierte Manipulatoren relativ zueinander bereitstellen können. Die Bewegung der Manipulatoren 102, 103, 104 kann mittels Aktoren initiiert werden, die durch die Steuereinrichtung 106 gesteuert werden.The robotic arm 101 may include articulation elements 107, 108, 109 that connect the manipulators 102, 103, 104 to each other and to the support 105. An articulation element 107, 108, 109 may comprise one or more articulations, each of which can provide rotational movement (i.e. rotational movement) and/or translational movement (i.e. translation) for associated manipulators relative to one another. The movement of the manipulators 102, 103, 104 can be initiated by means of actuators that are controlled by the control device 106.

Der Ausdruck „Aktor“ kann als eine Komponente verstanden werden, die als Reaktion auf ihren Antrieb zum Bewirken eines Mechanismus oder Prozesses ausgebildet ist. Der Aktor kann durch die Steuereinrichtung 106 erstellte Anweisungen (die sogenannte Aktivierung) in mechanische Bewegungen implementieren. Der Aktor, z. B. ein elektromechanischer Wandler, kann dazu ausgelegt sein, als Reaktion auf seinen Antrieb elektrische Energie in mechanische Energie umzuwandeln.The term "actuator" can be understood as a component configured to effect a mechanism or process in response to its impetus. The actuator can implement instructions (the so-called activation) created by the controller 106 into mechanical movements. The actor, e.g. An electromechanical converter, for example, may be configured to convert electrical energy into mechanical energy in response to being driven.

Der Ausdruck „Steuereinrichtung“ kann als ein beliebiger Typ von logikimplementierender Entität verstanden werden, die zum Beispiel eine Schaltung und/oder einen Prozessor beinhalten kann, die/der in der Lage ist, in einem Speicherungsmedium gespeicherte Software, Firmware oder eine Kombination davon auszuführen, und die/der Anweisungen, z. B. zu einem Aktor im vorliegenden Beispiel, ausstellen kann. Die Steuereinrichtung kann zum Beispiel durch Programmcode (z. B. Software) konfiguriert werden, um den Betrieb eines Systems, eines Roboters im vorliegenden Beispiel, zu steuern.The term "controller" may be understood as any type of logic implementing entity, which may include, for example, circuitry and/or a processor capable of executing software, firmware, or a combination thereof stored on a storage medium, and the instructions, e.g. B. to an actuator in the present example, can issue. For example, the controller may be configured by program code (e.g., software) to control operation of a system, a robot in the present example.

Im vorliegenden Beispiel beinhaltet die Steuereinrichtung 106 einen oder mehrere Prozessoren 110 und einen Speicher 111, der Code und Daten speichert, basierend auf denen der Prozessor 110 den Roboterarm 101 steuert. Gemäß verschiedenen Ausführungsformen steuert die Steuereinrichtung 106 den Roboterarm 101 auf Basis eines maschinellen Lernmodells 112, das im Speicher 111 gespeichert ist.In the present example, the controller 106 includes one or more processors 110 and a memory 111 storing code and data based on which the processor 110 controls the robotic arm 101 . According to various embodiments, the controller 106 controls the robotic arm 101 based on a machine learning model 112 stored in the memory 111 .

Die Steuereinrichtung 106 verwendet das maschinelle Lernmodell 112 dazu um die Pose eines Objekts 113 zu ermitteln, das zum Beispiel in einen Arbeitsraum des Roboterarms 101 platziert ist. Abhängigkeit von der ermittelten Pose kann die Steuereinrichtung 106 beispielsweise entscheiden, welcher Teil des Objekts 113 durch den Endeffektor 109 gegriffen werden sollte.The controller 106 uses the machine learning model 112 to determine the pose of an object 113 that is placed in a workspace of the robotic arm 101, for example. Depending on the determined pose, the control device 106 can decide, for example, which part of the object 113 should be gripped by the end effector 109 .

Die Steuereinrichtung 106 ermittelt die Pose unter Verwendung des maschinellen Lernmodells 112 unter Verwendung von ein oder mehreren Kamerabildern des Objekts 113. Der Roboter 100 kann zum Beispiel mit einer oder mehreren Kameras 114 ausgestattet sein, die es ihm ermöglichen, Bilder seines Arbeitsraums aufzunehmen. Die Kamera 114 ist zum Beispiel an dem Roboterarm 101 befestigt, sodass der Roboter Bilder des Objekts 113 von verschiedenen Perspektiven aus aufnehmen kann, indem er den Roboterarm 101 herumbewegt. Es können aber auch ein oder mehrere fixe Kameras vorgesehen sein.The controller 106 determines the pose using the machine learning model 112 using one or more camera images of the object 113. For example, the robot 100 may be equipped with one or more cameras 114 that enable it to capture images of its workspace. For example, the camera 114 is attached to the robot arm 101 so that the robot can capture images of the object 113 from different perspectives by moving the robot arm 101 around. However, one or more fixed cameras can also be provided.

Das maschinelle Lernmodell 112 ist beispielsweise ein (tiefes) neuronales Netzwerk, das für ein Kamerabild eine Merkmalskarte, z.B. in Form eines Bilds in einem Merkmalsraum, erzeugt, die es ermöglicht, Punkte im (2D) Kamerabild Punkten des (3D) Objekts zuzuordnen.The machine learning model 112 is, for example, a (deep) neural network that generates a feature map for a camera image, e.g. in the form of an image in a feature space, which makes it possible to assign points in the (2D) camera image to points of the (3D) object.

Beispielsweise kann das maschinelle Lernmodell 112 trainiert sein, ein bestimmten Ecke des Objekts einen bestimmten (eindeutigen) Merkmalswert (auch als Deskriptorwert bezeichnet) im Merkmalsraum zuzuweisen. Wird dem maschinellen Lernmodell 112 dann ein Kamerabild zugeführt und das maschinelle Lernmodell 112 ordnet einem Punkt des Kamerabilds diesen Merkmalswert zu, so kann man folgern, dass sich an dieser Stelle die Ecke befindet (also an einer Stelle im Raum, deren Projektion auf die Kameraebene dem Punkt im Kamerabild entspricht). Kennt man so die Position mehrerer Punkte des Objekts im Kamerabild, kann man die Pose des Objekts im Raum (die sogenannte 6D-Pose) beispielsweise durch Verwendung eines sogenannten PnP(perspective n-point)-Lösungsverfahrens ermitteln.For example, the machine learning model 112 may be trained to assign a specific (unique) feature value (also referred to as a descriptor value) in the feature space to a specific corner of the object. If a camera image is then supplied to the machine learning model 112 and the machine learning model 112 assigns this feature value to a point of the camera image, one can conclude that the corner is at this point (i.e. at a point in space whose projection onto the camera plane corresponds to the point in the camera image). If one knows the position of several points of the object in the camera image, one can determine the pose of the object in space (the so-called 6D pose), for example by using a so-called PnP (perspective n-point) solution method.

Das PnP-Problem ist das Problem, eine 6D-Pose (d.h. Position und Orientierung) eines Objekts aus einem 2D-Bild zu ermitteln, wenn die Zuordnung zwischen Punkten der 2D-Darstellung des Objekts im 2D-Bild und Punkten (typischerweise Vertices) des 3D-Objekts bekannt ist.The PnP problem is the problem of determining a 6D pose (i.e. position and orientation) of an object from a 2D image given the mapping between points of the 2D representation of the object in the 2D image and points (typically vertices) of the 3D object is known.

2 veranschaulicht das PnP-Problem. 2 illustrates the PnP problem.

Eine Kamera 201 nimmt ein Bild 202 eines Würfels 203 auf. Der Würfel 203 wird somit auf die Kamerabildebene 204 projiziert. Unter der Annahme, dass die Ecken des Würfels eindeutig sind (weil sie beispielsweise unterschiedliche Farben haben), kann die Zuordnung zwischen den Vertices des 3D-Modells (d.h. CAD-Modells) des Würfels 203 und den Pixeln im Bild 202 angegeben werden. Das PnP-Problem ist es, die Pose der Kamera 201 relativ zum Objekt 203 bzw. äquivalent die Pose des Objekts 203 relativ zur Kamera 201 (je nachdem welches Koordinatensystem als Referenz verwendet wird) zu ermitteln.A camera 201 takes an image 202 of a cube 203 . The cube 203 is thus projected onto the camera image plane 204 . Assuming that the corners of the cube are unique (e.g. because they have different colors), the mapping between the vertices of the 3D model (i.e. CAD model) of the cube 203 and the pixels in the image 202 can be specified. The PnP problem is to determine the pose of the camera 201 relative to the object 203 or, equivalently, the pose of the object 203 relative to the camera 201 (depending on which coordinate system is used as a reference).

Die Lösung des PnP-Problems erfordert die Zuordnung von Punkten im 2D-Objektbild 202 zu 3D-Objekt-Punkten (z.B. Vertices des 3D-Modells). Um diese zu bekommen, kann wie oben erläutert ein maschinelles Lernmodell verwendet werden, das Punkten im 2D-Objektbild 202 Deskriptorwerte zuordnet, wobei bekannt ist, welche 3D-Objekt-Punkte welche Deskriptorwerte haben, was die Zuordnung ermöglicht.Solving the PnP problem requires mapping points in the 2D object image 202 to 3D object points (e.g., vertices of the 3D model). To get these, as explained above, a machine learning model can be used that assigns descriptor values to points in the 2D object image 202, knowing which 3D object points have which descriptor values, which enables the assignment.

Das maschinelle Lernmodell 112 muss für diese Aufgabe geeignet trainiert werden.The machine learning model 112 must be suitably trained for this task.

Ein Beispiel für ein maschinelles Lernmodell 112 zur Objekterkennung ist ein dichtes Objektnetz. Ein dichtes Objektnetz bildet ein Bild (z. B. ein durch die Kamera 114 bereitgestelltes RGB-Bild) auf ein beliebiges dimensionales (Dimension D) Deskriptorraumbild ab. Es können aber auch andere maschinelle Lernmodelle 112 verwendet werden, insbesondere solche, die nicht notwendigerweise eine „dichte“ Merkmalskarte erzeugen, sondern lediglich bestimmten Punkten (z.B. Ecken) des Objekts Deskriptorwerte zuordnen.An example of a machine learning model 112 for object recognition is a dense object mesh. A dense object mesh maps an image (e.g., an RGB image provided by camera 114) onto an arbitrary dimensional (dimension D) descriptor space image. However, other machine learning models 112 can also be used, in particular those that do not necessarily generate a "dense" feature map, but merely assign descriptor values to certain points (e.g. corners) of the object.

3 veranschaulicht das Trainieren eines neuronalen Netzes 300 gemäß einer Ausführungsform. 3 3 illustrates training a neural network 300 according to one embodiment.

Das neuronale Netz 300 ist ein voll faltendes Netz (engl. fully convolutional network), das einen h × w × 3-Tensor (Eingabebild) auf einen h × w × D-Tensor (Ausgabebild) abbildet.The neural network 300 is a fully convolutional network that maps an h×w×3 tensor (input image) to an h×w×D tensor (output image).

Es umfasst mehrere Stufen 304 von Faltungsschichten, gefolgt von einer Pooling-Schicht, Upsampling-Schichten 305 und Skip-Verbindungen 306, um die Ausgaben verschiedener Schichten zu kombinieren.It comprises several stages 304 of convolutional layers, followed by a pooling layer, upsampling layers 305 and skip connections 306 to combine the outputs of different layers.

Für das Training empfängt das neuronale Netz 300 ein Trainings-Kamerabild 301 und gibt ein Ausgabebild 302 mit Pixelwerten im Deskriptorraum (z. B. Farbkomponenten gemäß Deskriptorvektorkomponenten) aus. Ein Trainingsverlust wird zwischen dem Ausgabebild 302 und dem mit dem Trainings-Kamerabild assoziierten Zielbild 303 berechnet. Dies kann für einen Satz von Trainings-Kamerabildern stattfinden und der Trainingsverlust kann über die Trainings-Kamerabilder gemittelt werden und die Gewichte des neuronalen Netzes 300 werden unter Verwendung stochastischen Gradientenabstiegs unter Verwendung des Trainingsverlustes trainiert. Der zwischen dem Ausgabebild 302 und dem Zielbild 303 berechnete Trainingsverlust ist zum Beispiel eine L2-Verlustfunktion (um einen pixelweisen Least-Square-Fehler zwischen dem Zielbild 303 und dem Ausgabebild 302 zu minimieren).For training, the neural network 300 receives a training camera image 301 and outputs an output image 302 with pixel values in descriptor space (e.g. color components according to descriptor vector components). A training loss is calculated between the output image 302 and the target image 303 associated with the training camera image. This can take place for a set of training camera images and the training loss can be averaged over the training camera images and the weights of the neural network 300 trained using stochastic gradient descent using the training loss. For example, the training loss computed between the output image 302 and the target image 303 is an L2 loss function (to minimize a pixel-by-pixel least squares error between the target image 303 and the output image 302).

Das Trainings-Kamerabild 301 zeigt ein Objekt und das Zielbild sowie das Ausgabebild beinhalten Werte (ggf. Tupel von Werten, d.h. Vektoren) im Deskriptorraum. Die Werte im Deskriptorraum können auf Farben abgebildet werden, sodass das Ausgabebild 302 (sowie das Zielbild 303) eine Heatmap des Objekts ähneln.The training camera image 301 shows an object and the target image as well as the output image contain values (possibly tuples of values, i.e. vectors) in the descriptor space. The values in the descriptor space can be mapped to colors so that the output image 302 (as well as the target image 303) resembles a heat map of the object.

Die Vektoren im Deskriptorraum (auch als (ggf. dichte) Deskriptoren bezeichnet) sind beispielsweise d-dimensionale Vektoren (z. B. beträgt d 1, 2 oder 3), die jedem Pixel im jeweiligen Bild (z. B. jedem Pixel des Eingabebildes 301, unter der Annahme, dass das Eingabebild 301 und das Ausgabebild 302 die gleiche Dimension aufweisen) zugewiesen sind. Die Deskriptoren codieren implizit die Oberflächentopologie des im Eingabebild 301 gezeigten Objekts, invariant gegenüber seiner Stellung oder der Kameraposition.For example, the vectors in the descriptor space (also called (possibly dense) descriptors) are d-dimensional vectors (e.g., d is 1, 2, or 3) corresponding to each pixel in the respective image (e.g., each pixel of the input image 301, assuming that the input image 301 and the output image 302 have the same dimension). The descriptors implicitly encode the surface topology of the object shown in the input image 301, invariant to its pose or camera position.

Wenn ein 3D-Modell des Objekts gegeben ist, ist es möglich, einen eindeutigen Deskriptorvektor für jeden Vertex des 3D-Modells des Objekts analytisch zu bestimmen und so ein überwachtes Training des neuronalen Netzes 300 durchzuführen. Es besteht aber auch die Möglichkeit eines selbst-überwachten Trainings des maschinellen Lernmodells 112.Given a 3D model of the object, it is possible to analytically determine a unique descriptor vector for each vertex of the 3D model of the object and thus perform supervised training of the neural network 300 . However, there is also the possibility of self-monitored training of the machine learning model 112.

Beim Training des maschinellen Lernmodells 112 zum Abbilden von Kamerabildern auf Bilder im Deskriptorraum (d.h. Merkmalskarten) kann es jedoch zu Problemen kommen, wenn ein Objekt symmetrisch ist. Dies bedeutet nämlich, dass das Objekt für mehrere Objektposen dasselbe zweidimensionale Bild hat aber (unter der Annahme, dass die Deskriptorenwerte eindeutig sind) denselben Punkten im zweidimensionalen Bild unterschiedliche Deskriptorwerte zuzuordnen sind (z.B. gemäß Grundwahrheit, d.h. gemäß Zielbild). Dies erzeugt Probleme beim Training des maschinellen Lernmodells 112, da dieses unterschiedliche Ziele für dieselbe Eingabe bekommt.However, problems can arise when training the machine learning model 112 to map camera images to images in descriptor space (i.e., feature maps) when an object is symmetric. This means that the object has the same two-dimensional image for several object poses but (assuming that the descriptor values are unique) different descriptor values are assigned to the same points in the two-dimensional image (e.g. according to ground truth, i.e. according to target image). This creates problems in training the machine learning model 112 as it gets different goals for the same input.

4 veranschaulicht das Problem symmetrischer Objekte am Beispiel eines Würfels. 4 illustrates the problem of symmetrical objects using a cube as an example.

Ein Würfel mit einem Kamerabild 401 kann mehrere Posen haben, die identisch Aussehen, wenn sie auf die Kamerabildebene projiziert werden, denn eine Rotation um 90 Grad um eine seiner Symmetrieachsen ändert sein 2D-Bild nicht. Sind die Deskriptorwerte aber eindeutig (also insbesondere für alle seiner acht Ecken unterschiedlich) so ergibt sich je nach Objektpose ein unterschiedliches Deskriptorbild 402, 403, 404.A cube with a camera image 401 can have multiple poses that appear identical when projected onto the camera image plane, since a 90 degree rotation about one of its symmetry axes does not change its 2D image. However, if the descriptor values are unique (i.e. different in particular for all of its eight corners), a different descriptor image 402, 403, 404 results depending on the object pose.

Dies bedeutet, dass das maschinelle Lernmodell für die gleiche Eingabe (Kamerabild 401) beim Training unterschiedliche Deskriptorraum-Zielbilder 402, 403, 404 bekommen kann.This means that the machine learning model can get different descriptor space target images 402, 403, 404 for the same input (camera image 401) when trained.

Gemäß verschiedenen Ausführungsformen wird deshalb eine Objektpose normalisiert, bevor sie zum Training verwendet wird. Für das überwachte Training kann dies bedeuten, dass eine Objektpose normalisiert wird, bevor ein Trainings-Kamerabild und ein Trainings-Deskriptorwertzielbild für die Objektpose erzeugt wird, die dem maschinellen Lernmodell als Trainingsdaten zugeführt werden.Therefore, according to various embodiments, an object pose is normalized before it is used for training. For supervised training, this can mean that an object pose is normalized before generating a training camera image and a training descriptor value target image for the object pose, which are fed to the machine learning model as training data.

Das Normalisieren der Objektposen beinhaltet, dass die Objektpose eines symmetrischen Objekts in eine eindeutige äquivalente Objektpose (d.h. eine Objektpose mit gleicher Projektion auf die Kamerabildebene) geändert wird.Normalizing object poses involves changing the object pose of a symmetrical object to a unique equivalent object pose (i.e. an object pose with the same projection onto the camera image plane).

5 zeigt ein erstes Beispiel für Deskriptorbilder 501, 502 eines Objekts in einer Ausgangspose und einer Referenzpose. 5 shows a first example for descriptor images 501, 502 of an object in an initial pose and a reference pose.

Im ersten Deskriptorbild 501 ist das Objekt in einer ursprünglichen (Ausgangs-)Pose. Diese Pose (und damit das erste Deskriptorbild 501) wird jedoch nicht für das Training verwendet, sondern das Objekt wird in eine Referenzpose (speziell in eine Referenzorientierung) gebracht, der das zweite Deskriptorbild 502 entspricht (d.h. Rendern des Objekts in der Referenzpose als Deskriptorbild liefert das zweite Deskriporbild 502). Die Position des Objekts kann unverändert bleiben.In the first descriptor image 501, the object is in an original (initial) pose. However, this pose (and thus the first descriptor image 501) is not used for training, but the object is brought into a reference pose (specifically in a reference orientation) that corresponds to the second descriptor image 502 (i.e. rendering the object in the reference pose as a descriptor image provides the second descriptor image 502). The position of the object can remain unchanged.

Die Gestalt des Objekts in der Kamerabildebene (hier die Zeichenebene) ist in beiden Posen dieselbe, aber die durch Rendern erzeugten Deskriptorbilder sind unterschiedlich (wobei die Deskriporenwerte hier als Schattierungen angedeutet sind).The shape of the object in the camera image plane (here the drawing plane) is the same in both poses, but the rendered descriptor images are different (the descriptor values are indicated here as shading).

6 zeigt ein zweites Beispiel für Deskriptorbilder 601, 602 eines Objekts in einer Ausgangspose und einer Referenzpose. 6 Figure 12 shows a second example of descriptor images 601, 602 of an object in an initial pose and a reference pose.

Im ersten Deskriptorbild 601 ist das Objekt in einer ursprünglichen (Ausgangs-)Pose. Dieses Deskriptorbild 601 wird nicht für das Training verwendet, sondern das Objekt wird in eine Referenzpose (speziell in eine Referenzorientierung) gebracht und in dieser gerendert, was das zweite Deskriptorbild 602 liefert. Die Position des Objekts kann unverändert bleiben.In the first descriptor image 601, the object is in an original (initial) pose. This descriptor image 601 is not used for training, but the object is brought into a reference pose (specifically, in a reference orientation) and rendered in this, which provides the second descriptor image 602 . The position of the object can remain unchanged.

Wiederum ist die Gestalt des Objekts in der Kamerabildebene (hier die Zeichenebene) aufgrund der Symmetrie des Objekts in beiden Posen dieselbe, aber die erzeugten Deskriptorbilder sind unterschiedlich. Nur das zweite Deskriptorbild 602, nicht jedoch das erste Deskriptorbild 601, wird als Deskriptor-Zielbild für das Training des maschinellen Lernmodells 112 verwendet, sodass das maschinelle Lernmodell 112 ein eindeutiges Ziel für das eingegebene Kamerabild des Objekts hat.Again, the shape of the object in the camera image plane (here the drawing plane) is the same in both poses due to the symmetry of the object, but the generated descriptor images are different. Only the second descriptor image 602, but not the first descriptor image 601, is used as a descriptor target image for training the machine learning model 112 so that the machine learning model 112 has a unique target for the input camera image of the object.

Im Folgenden wird ein Beispiel für einen Algorithmus angegeben, der ein Objekt, das sich in einer bestimmten ursprünglichen Pose befindet in eine Referenzpose (konkret hier eine Referenzorientierung) bringt.An example of an algorithm is given below, which brings an object that is in a certain original pose into a reference pose (specifically a reference orientation here).

Hauptroutine:main routine:

Normalisiere_Pose zum Berechnen der normalisierten Orientierung eines Objekts. Eingabe:

- Orientierung des Objekts (d.h. Rotation des Objekts ausgehend von einer Standardorientierung)
- Position des Objekts (d.h. Translation des Objekts ausgehend von einer Standardposition)
- Symmetrien des Objekts (z.B. gegeben als Drehachsen und zugehörige mögliche Drehwinkel; diese Achsen sind z.B. als x-Achse, y-Achse und z-Achse des Objekts festgelegt, wobei nicht für jedes Objekt alle drei Symmetrien vorhanden seien müssen, d.h. eine Symmetrie kann für eine Achse auch Null sein)

normalize_pose to calculate the normalized orientation of an object. Input:

- Orientation of the object (i.e. rotation of the object from a default orientation)
- position of the object (ie translation of the object from a default position)
- Symmetries of the object (e.g. given as axes of rotation and associated possible angles of rotation; these axes are e.g. as x-axis, y- Axis and z-axis of the object are defined, whereby not all three symmetries have to be present for each object, i.e. a symmetry can also be zero for an axis)

Operationen:

1.) Drehe um die z-Achse und minimiere den Unterschied zwischen der x-Achse und dem Vektor von dem Objekt und der Kamera:
- R <- beschaffe_normalisierte_Orientierung(R, T, Drehachse = 2, Symmetrie = Symmetrie in z)
2.) Drehe um die y-Achse und minimiere den Unterschied zwischen der x-Achse und dem Vektor von dem Objekt und der Kamera:
- R <- beschaffe_normalisierte_Orientierung(R, T, Drehachse = 1, Symmetrie = Symmetrie in y)
3.) Drehe um die x-Achse und minimiere den Unterschied zwischen der y-Achse und dem Vektor, der nach oben zeigt (und senkrecht ist zum Vektor, der vom Objekt Richtung Kamera zeigt)
- R <- beschaffe_normalisierte_Orientierung(R, T, Drehachse = 0, Symmetrie = Symmetrie in x)
4.) Gib die resultierende Orientierung des Objekts aus

Operations:

1.) Rotate around the z-axis and minimize the difference between the x-axis and the vector from the object and the camera:
- R <- get_normalized_orientation(R, T, axis of rotation = 2, symmetry = symmetry in z)
2.) Rotate around the y-axis and minimize the difference between the x-axis and the vector from the object and the camera:
- R <- get_normalized_orientation(R, T, axis of rotation = 1, symmetry = symmetry in y)
3.) Rotate around the x-axis and minimize the difference between the y-axis and the vector pointing up (and perpendicular to the vector pointing from the object towards the camera)
- R <- get_normalized_orientation(R, T, axis of rotation = 0, symmetry = symmetry in x)
4.) Output the resulting orientation of the object

Unterroutine:subroutine:

Beschaffe_normalisierte_Orientierung zum Berechnen der normalisierten Orientierung um eine Achse.Get_normalized_orientation to calculate the normalized orientation about an axis.

Es gibt drei Fälle:

Drehachse 0: Drehe um die x-Achse und minimiere den Unterschied zwischen der y-Achse und dem Vektor, der nach oben zeigt.
Drehachse 1: Drehe um die y-Achse und minimiere den Unterschied zwischen der x-Achse und dem Vektor von dem Objekt und der Kamera.
Drehachse 2: Drehe um die z-Achse und minimiere den Unterschied zwischen der x-Achse und dem Vektor von dem Objekt und der Kamera.

There are three cases:

Rotation Axis 0: Rotate around the x-axis and minimize the difference between the y-axis and the vector pointing up.
Rotation axis 1: Rotate around the y-axis and minimize the difference between the x-axis and the vector from the object and the camera.
Axis of rotation 2: Rotate around the z-axis and minimize the difference between the x-axis and the vector from the object and the camera.

Eingabe:

- Orientierung des Objekts
- Position des Objekts
- Drehachse
- Symmetrie für die jeweilige Drehachse (z.B. Winkel, um die das Objekt um die Drehachse gedreht werden kann, ohne dass sich seine Projektion ändert)

Input:

- Orientation of the object
- Position of the object
- axis of rotation
- Symmetry for the respective axis of rotation (e.g. angles by which the object can be rotated around the axis of rotation without changing its projection)

Operationen:

1.) Berechne Drehwinkel, d.h. den Winkel, um den das Objekt in die jeweilige Referenzpose gedreht werden kann, abhängig vom jeweiligen Fall, d.h. den Drehwinkel, der den Unterschied zwischen den beiden jeweiligen Vektoren minimiert
2.) Drehe das Objekt um die Drehachse um den berechneten Drehwinkel
3.) Gib die resultierende Orientierung des Objekts aus

Operations:

1.) Calculate rotation angle, ie the angle by which the object can be rotated to the respective reference pose, depending on the respective case, ie the rotation angle that minimizes the difference between the two respective vectors
2.) Rotate the object around the axis of rotation by the calculated angle of rotation
3.) Output the resulting orientation of the object

Es sollte beachtet werden, dass es für jede Projektion des Objekts in die Kamerabildebene eine jeweilige Referenzpose gibt. Für unterschiedliche Projektionen sind die Referenzposen auch unterschiedlich.It should be noted that there is a respective reference pose for each projection of the object into the camera image plane. For different projections, the reference poses are also different.

Ist das Objekt in der Referenzpose, kann dafür ein zu dem Kamerabild (das der Projektion des Objekts in die Kamerabildebene entspricht) ein zugehöriges Deskriptorbild gerendert werden. Das Kamerabild und das Deskriptorbild können dann zusammen als Trainingsdatenelement für ein überwachtes Training verwendet werden.If the object is in the reference pose, a descriptor image associated with the camera image (which corresponds to the projection of the object in the camera image plane) can be rendered for this. The camera image and the descriptor image can then be used together as a training data item for supervised training.

Zusammenfassend wird gemäß verschiedenen Ausführungsformen ein Verfahren bereitgestellt, wie es im Folgenden mit Bezug auf 7 beschrieben wird.In summary, according to various embodiments, a method is provided as described below with reference to FIG 7 is described.

7 zeigt ein Ablaufdiagramm 700 für ein Verfahren zum Trainieren eines maschinellen Lernmodells zum Erkennen einer Objekttopologie eines Objekts in einem Bild des Objekts. 7 FIG. 7 shows a flow diagram 700 for a method for training a machine learning model for recognizing an object topology of an object in an image of the object.

In 701 wird ein 3D-Modells eines Objekts ermittelt, wobei das 3D-Modell ein Gitter von Vertices aufweist.In 701, a 3D model of an object is determined, the 3D model having a grid of vertices.

In 702 wird ein Deskriptorwerts für jeden Vertex des Gitters bestimmt.In 702, a descriptor value is determined for each vertex of the grid.

In 703 werden Trainingsdatenbildpaare erzeugt, wobei jedes Trainingsdatenbildpaar ein das Objekt zeigendes Trainings-Kamerabild und ein Deskriptorwert-Zielbild aufweist, wobei das Erzeugen des Trainings-Kamerabilds und des Deskriptorwert-Zielbilds aufweist:

• In 704 das Auswählen einer Position für das Objekt relativ zu einer vorgegebenen Kamera.
• In 705 das Auswählen einer Orientierung für das Objekt relativ zu der vorgegebenen Kamera.
• In 706 das Anpassen der Orientierung zu einer für die ausgewählte Position und die ausgewählte Orientierung vorgegebene Referenzorientierung, bei der das Objekt die gleiche Projektion auf die Bildebene der Kamera wie bei der ausgewählten Orientierung hat.
• In 707 das Erzeugen des Trainings-Kamerabilds durch Erzeugen eines Kamerabilds des Objekts mit der ausgewählten Position und der Referenzorientierung aus Sicht der Kamera
• In 708 das Erzeugen des Deskriptorwert-Zielbilds durch Zuweisen, für jede Vertexposition des Objekts in dem Trainings-Kamerabild, des für den Vertex an der Vertexposition bestimmten Deskriptorwerts zu der Position in dem Deskriptorwert-Zielbild.

In 703 pairs of training data images are generated, each pair of training data images comprising a training camera image showing the object and a descriptor value target image, generating the training camera image and the descriptor value target image comprising:

• At 704, selecting a position for the object relative to a given camera.
• At 705, selecting an orientation for the object relative to the given camera.
• At 706, adjusting the orientation to a predetermined reference for the selected position and orientation orientation in which the object has the same projection onto the image plane of the camera as in the selected orientation.
• In 707 generating the training camera image by generating a camera image of the object with the selected position and the reference orientation as seen from the camera
• In 708 generating the descriptor value target image by assigning, for each vertex position of the object in the training camera image, the descriptor value determined for the vertex at the vertex position to the position in the descriptor value target image.

In 709 wird das maschinelle Lernmodell durch überwachtes Lernen unter Verwendung der Trainingsdatenbildpaare als Trainingsdaten zur Erzeugung von Deskriptorwert-Bildern für zugeführte Kamerabilder trainiert.In 709, the machine learning model is trained by supervised learning using the training data image pairs as training data to generate descriptor value images for input camera images.

Gemäß verschiedenen Ausführungsformen wird in anderen Worten vor dem Trainieren für jede Objektpose eine Normalisierung der Objektpose durchgeführt. Dabei werden äquivalente Posen (d.h. solche, die dieselbe Projektion in die Kamerabildebene liefern) eines symmetrischen Objekts auf eine eindeutige Objektpose abgebildet. Beim Trainieren eines maschinellen Lernmodells bekommt das maschinelle Lernmodell dementsprechend eindeutige Ziele (z.B. Grundwahrheiten (ground truths) für überwachtes Training), was ein effizientes und robustes Training ermöglicht.In other words, according to various embodiments, a normalization of the object pose is performed for each object pose before the training. Equivalent poses (i.e. those that provide the same projection into the camera image plane) of a symmetrical object are mapped to a unique object pose. Accordingly, when training a machine learning model, the machine learning model is given clear goals (e.g. ground truths for supervised training), which enables efficient and robust training.

Die Kamerabilder sind beispielsweise RGB-Bilder, können aber auch andere Arten von Kamerabildern wie Tiefenbilder oder Wärmebilder sein. Die Ausgabe des trainierten maschinellen Lernmodells kann dazu verwendet werden, Objektposen zu ermitteln, beispielsweise zur Steuerung eines Roboters, z.B. für den Zusammenbau eines größeren Objekts aus Teilobjekten, das Bewegen von Objekten etc. Die Herangehensweise von 7 kann für jegliches Posenermittlungsverfahren verwendet werden, das das Erzeugen einer Merkmalskarte und die Anwendung eines PnP-Lösungsverfahrens auf die Merkmalskarte beinhaltet.The camera images are, for example, RGB images, but can also be other types of camera images such as depth images or thermal images. The output of the trained machine learning model can be used to determine object poses, for example to control a robot, eg to assemble a larger object from sub-objects, to move objects, etc. The approach of 7 can be used for any pose detection method that involves generating a feature map and applying a PnP solution method to the feature map.

Unter „Roboter“ kann jegliches physisches System (mit einem mechanischen Teil, dessen Bewegung gesteuert wird), wie eine computergesteuerte Maschine, ein Fahrzeug, ein Haushaltsgerät, ein Elektrowerkzeug, eine Fertigungsmaschine, ein persönlicher Assistent oder ein Zugangskontrollsystem verstanden werden.A “robot” can be understood to mean any physical system (having a mechanical part whose movement is controlled), such as a computer-controlled machine, vehicle, household appliance, power tool, manufacturing machine, personal assistant, or access control system.

Das Verfahren ist gemäß einer Ausführungsform Computer-implementiert.According to one embodiment, the method is computer-implemented.

Obwohl die Erfindung vor allem unter Bezugnahme auf bestimmte Ausführungsformen gezeigt und beschrieben wurde, sollte es von denjenigen, die mit dem Fachgebiet vertraut sind, verstanden werden, dass zahlreiche Änderungen bezüglich Ausgestaltung und Details daran vorgenommen werden können, ohne vom Wesen und Bereich der Erfindung, wie er durch die nachfolgenden Ansprüche definiert wird, abzuweichen. Der Bereich der Erfindung wird daher durch die angefügten Ansprüche bestimmt, und es ist beabsichtigt, dass sämtliche Änderungen, welche unter den Wortsinn oder den Äquivalenzbereich der Ansprüche fallen, umfasst werden.Although the invention has been shown and described with particular reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention, as defined by the following claims. The scope of the invention is therefore indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are intended to be embraced.

Claims

Method for training a machine learning model for recognizing an object topology of an object from a camera image of the object having: determining a 3D model of an object, the 3D model having a grid of vertices; determining a descriptor value for each vertex of the grid; generating training data image pairs, each training data image pair comprising a training camera image showing the object and a descriptor value target image, and wherein generating the training camera image and the descriptor value target image comprises: selecting a position for the object relative to a given camera; selecting an orientation for the object relative to the given camera; adjusting the orientation to a reference orientation predetermined for the selected position and orientation at which the object has the same projection onto the image plane of the camera as at the selected orientation; generating the training camera image by generating a camera image of the object with the selected position and the reference orientation as seen from the camera and generating the descriptor value target image by assigning, for each vertex position of the object in the training camera image, the vertex determined for the vertex position descriptor value to the position in the descriptor value target image training the machine learning model by supervised learning using the training data image pairs as training data to generate descriptor value images for supplied camera images.

procedure according to claim 1 , wherein the reference orientation for the object is given by aligning, for each axis of symmetry of the object, an axis of the object that is perpendicular to the axis of symmetry to a predetermined reference axis.

procedure according to claim 2 , where the reference axis is given by the direction of the camera from the point of view of the object.

Procedure according to one of Claims 1 until 3 wherein pairs of training data images are generated for a plurality of different positions that are selected and a plurality of different orientations that are selected.

A method for controlling a robot, comprising: training a machine learning model according to any one of Claims 1 until 4 ; obtaining a camera image showing the object; feeding the camera image to the machine learning model; determining a pose of the object from the output of the machine learning model; and controlling the robot depending on the determined pose of the object.

procedure after claim 5 , wherein determining the pose of the object comprises determining vertex positions of the object in the camera image using the descriptor value image output from the machine learning model for the camera image.

procedure after claim 5 or 6 , wherein determining the pose of the object comprises determining the position of a particular part of the object, and wherein controlling the robot dependent on the determined pose of the object includes controlling an end effector of the robot to move to the position of the part of the object and to interact with the part of the object.

Software or hardware agent, in particular a robot, which has the following: a camera which is set up to provide image data of an object; a controller configured to implement a machine learning model; and a training device for training the machine learning model using the method according to any one of Claims 1 until 7 is set up.

software or hardware agent claim 8 , having at least one actuator, wherein the controller is configured to control the at least one actuator using an output from the machine learning model.

A computer program comprising instructions which, when executed by a processor, cause the processor to perform a method according to any one of Claims 1 until 7 performs.

A computer-readable medium storing instructions which, when executed by a processor, cause the processor to perform a method according to any one of Claims 1 until 7 performs.