DE102021202570A1

DE102021202570A1 - DEVICE AND METHOD FOR LOCATING LOCATIONS OF OBJECTS FROM CAMERA IMAGES OF THE OBJECTS

Info

Publication number: DE102021202570A1
Application number: DE102021202570.6A
Authority: DE
Inventors: Andras Gabor Kupcsik; Philipp Christian Schillinger
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2022-09-22
Also published as: JP2022142773A; CN115082550A

Abstract

Gemäß verschiedenen Ausführungsformen wird ein Verfahren zum Lokalisieren von Stellen von Objekten aus Kamerabildern der Objekte beschrieben, aufweisend, bei dem ein maschinelles Lernmodell zum Abbilden von Kamerabildern auf Deskriptorbilder trainiert wird, für jedes Kamerabild, die zu lokalisierenden Stellen auf dem jeweiligen Objekt mittels der von dem trainierten maschinellen Lernmodell gelieferten Deskriptoren ermittelt werden und die Deskriptoren, die zum Ermitteln der zu lokalisierenden Stellen verwendet werden, im Laufe der Folge von Kamerabildern aktualisiert werden.According to various embodiments, a method for locating locations of objects from camera images of the objects is described, comprising, in which a machine learning model for mapping camera images to descriptor images is trained, for each camera image, the locations to be localized on the respective object by means of the descriptors provided by the trained machine learning model are determined and the descriptors used to determine the locations to be located are updated over the course of the sequence of camera images.

Description

Die vorliegende Offenbarung betrifft Vorrichtungen und Verfahren zum Lokalisieren von Stellen von Objekten aus Kamerabildern der Objekte.The present disclosure relates to devices and methods for locating locations of objects from camera images of the objects.

Um eine flexible Herstellung oder Bearbeitung von Objekten durch einen Roboter zu ermöglichen, ist es wünschenswert, dass der Roboter in der Lage ist, ein Objekt ungeachtet der Stellung, mit der das Objekt in den Arbeitsraum des Roboters platziert wird, zu handhaben. Daher sollte der Roboter fähig sein, zu erkennen, welche Teile des Objektes sich an welchen Positionen befinden, sodass er zum Beispiel das Objekt an der korrekten Stelle greifen kann, um es z. B. an einem anderen Objekt zu befestigen, oder das Objekt am aktuellen Ort zu schweißen. Dies bedeutet, dass der Roboter fähig sein sollte, die Pose (Position und Orientierung) des Objekts oder auch Bereiche des Objekts wie einen Barcode aus einem oder mehreren Bildern, die durch eine am Roboter befestigte Kamera aufgenommen werden, zu erkennen. Ein Ansatz zum Erzielen davon besteht darin, Deskriptoren, d. h. Punkte (Vektoren) in einem vordefinierten Deskriptorraum, für Teile des Objekts (d. h. in einer Kamerabildebene repräsentierte Pixel des Objekts) zu bestimmen, wobei der Roboter trainiert ist, unabhängig von einer aktuellen Pose des Objekts den gleichen Teilen eines Objekts die gleichen Deskriptoren zuzuweisen und somit die Topologie des Objekts in dem Bild zu erkennen, sodass dann beispielsweise bekannt ist, wo sich welche Ecke des Objekts in dem Bild befindet. Bei Kenntnis der Pose der Kamera lässt sich dann wiederum auf die Pose des Objekts bzw. die Position von Bereichen des Objekts im dreidimensionalen Raum rückschließen. Das Erkennen der Topologie lässt sich mit einem maschinellen Lernmodell realisieren, das entsprechend trainiert wird. Allerdings treten auch bei dieser Vorgehensweise Schwierigkeiten auf, weil es z.B. vorkommen kann, dass das maschinelle Lernmodell einen Deskriptor aufgrund unterschiedlicher Lichtverhältnisse (insbesondere Reflexionen) oder unzureichendem Training in verschiedenen Kamerabildern nicht immer der gleichen Stelle des Objekts zuweist. Dies kann dann zu Ungenauigkeiten bei der Posenermittlung oder allgemein der Ermittlung von Stellen des Objekts, führen. Dementsprechend sind Herangehensweisen wünschenswert, mit denen einen höhere Genauigkeit, z.B. bei der Ermittlung von Posen, erzielt werden kann.In order to enable flexible production or processing of objects by a robot, it is desirable for the robot to be able to handle an object regardless of the posture with which the object is placed in the working space of the robot. Therefore, the robot should be able to recognize which parts of the object are in which positions, so that it can, for example, grab the object in the correct place to e.g. B. to attach to another object, or to weld the object at the current location. This means that the robot should be able to recognize the pose (position and orientation) of the object, or even regions of the object like a barcode, from one or more images captured by a camera attached to the robot. One approach to achieve this is to use descriptors, i. H. To determine points (vectors) in a predefined descriptor space, for parts of the object (i.e. pixels of the object represented in a camera image plane), the robot being trained to assign the same descriptors to the same parts of an object, regardless of a current pose of the object, and thus to recognize the topology of the object in the image, so that it is then known, for example, where each corner of the object is located in the image. If the pose of the camera is known, the pose of the object or the position of areas of the object in three-dimensional space can then in turn be inferred. The recognition of the topology can be realized with a machine learning model that is trained accordingly. However, difficulties also arise with this approach, because it can happen, for example, that the machine learning model does not always assign a descriptor to the same part of the object due to different lighting conditions (especially reflections) or insufficient training in different camera images. This can then lead to inaccuracies in the determination of poses or generally in the determination of positions of the object. Accordingly, approaches that can be used to achieve greater accuracy, e.g. when determining poses, are desirable.

Gemäß verschiedenen Ausführungsformen wird ein Verfahren zum Lokalisieren von Stellen von Objekten aus Kamerabildern der Objekte bereitgestellt, aufweisend: Festlegen der zu lokalisierenden Stellen für einen Objekttyp der Objekte, Bestimmen einer Referenz für die relative Lage der zu lokalisierenden Stellen, Trainieren, für den Objekttyp, eines maschinellen Lernmodells zum Abbilden von Kamerabildern, wobei jedes Kamerabild ein Objekt des Objekttyps zeigt, auf Deskriptorbilder, wobei ein Deskriptorbild, auf das ein Kamerabild abzubilden ist, für eine Stelle des Objekts, die das Kamerabild an einer Bildposition zeigt, an der Bildposition einen Deskriptor der Stelle des Objekts aufweist, Setzen eines Referenzsatzes von Deskriptoren auf einen Anfangssatz von Deskriptoren, Empfangen einer zeitlichen Folge von Kamerabildern, wobei jedes Kamerabild ein Objekt des Objekttyps zeigt, und Lokalisieren, für jedes Kamerabild, der zu lokalisierenden Stellen auf dem jeweiligen Objekt, durch:

Abbilden des Kamerabilds auf ein Deskriptorbild mittels des trainierten maschinellen Lernmodells
Identifizieren der zu lokalisierenden Stellen des Objekts für den Referenzsatz von Deskriptoren durch Suchen der Deskriptoren des Referenzsatzes von Deskriptoren in dem Deskriptorbild; und

zumindest für einen Teil der Kamerabilder der Folge von Kamerabildern aufweisend:

Ermitteln der relativen Lage der lokalisierten Stellen für den Referenzsatz; und Vergleich der Referenz für die relativen Lage der zu lokalisierenden Stellen und der für den Referenzsatz ermittelten relativen Lage der lokalisierten Stellen, Identifizieren der zu lokalisierenden Stellen des Objekts für einen Testsatz von Deskriptoren durch Suchen der Deskriptoren des Testsatzes von Deskriptoren in dem Deskriptorbild,
Ermitteln der relativen Lage der lokalisierten Stellen für den Testsatz; und Vergleich der Referenz für die relativen Lage der zu lokalisierenden Stellen und der für den Testsatz ermittelten relativen Lage der lokalisierten Stellen und Aktualisieren der Deskriptoren des Referenzsatzes auf die Deskriptoren des Testsatzes falls die Übereinstimmung für ein oder mehrere der Kamerabilder zwischen der Referenz für die relativen Lage der zu lokalisierenden Stellen und der ermittelten relativen Lage der lokalisierten Stellen für den Testsatz besser ist als für den Referenzsatz.

According to various embodiments, a method for locating positions of objects from camera images of the objects is provided, comprising: determining the positions to be localized for an object type of the objects, determining a reference for the relative position of the positions to be localized, training for the object type, one machine learning model for mapping camera images, each camera image showing an object of the object type, onto descriptor images, wherein a descriptor image, onto which a camera image is to be mapped, for a point of the object which shows the camera image at an image position, at the image position a descriptor of the location of the object, setting a reference set of descriptors to an initial set of descriptors, receiving a temporal sequence of camera images, each camera image showing an object of the object type, and locating, for each camera image, the locations to be localized on the respective object by:

Mapping the camera image to a descriptor image using the trained machine learning model
identifying the locations of the object to be located for the reference set of descriptors by searching the descriptors of the reference set of descriptors in the descriptor image; and

having at least for a part of the camera images of the sequence of camera images:

determining the relative location of the located locations for the reference set; and comparing the reference for the relative position of the locations to be located and the relative location of the located locations determined for the reference set, identifying the locations of the object to be located for a test set of descriptors by searching the descriptors of the test set of descriptors in the descriptor image,
determining the relative location of the located locations for the test set; and comparing the reference for the relative locations of the locations to be located and the relative locations of the located locations determined for the test set and updating the descriptors of the reference set to the descriptors of the test set if the match for one or more of the camera images between the reference for the relative locations of the locations to be located and the determined relative position of the located locations for the test set is better than for the reference set.

Das oben beschriebene Verfahren ermöglicht es für Fälle, in denen mehrere Deskriptoren verwendet (z.B. verfolgt) werden, wie für die Posenermittlung oder die Ermittlung (oder Verfolgung) eines Bereichs, eine genauere Lokalisierung von Stellen des Objekts, da die verwendeten Deskriptoren über die Folge von Kamerabildern in der Hinsicht verbessert werden, dass sie besser ihre theoretische Eigenschaft erfüllen, nämlich, dass sie ansichtsinvariant immer den gleichen Stellen zugeordnet werden. Insbesondere erhöht das oben beschriebene Verfahren die Robustheit gegenüber Variationen der Lichtverhältnisse etc. in einer Folge von Kamerabildern.The method described above allows for cases where multiple descriptors are used (e.g. tracked), such as for pose detection or region detection (or tracking), a more accurate localization of locations of the object, since the descriptors used over the sequence of Camera images are improved in the sense that they better fulfill their theoretical property, namely that they are always assigned to the same positions in a view-invariant manner. In particular, the method described above increases the robustness to variations in lighting conditions etc. in a sequence of camera images.

Auf diese Weise ermöglicht das oben beschriebene Verfahren z.B. das sichere Aufnehmen (z.B. Greifen) von einem Objekt für eine beliebige Lage des Objekts oder auch die genaue Ermittlung bestimmter Bereiche des Objekts wie z.B. den Ort eines Barcodes.In this way, the method described above enables, for example, the safe picking up (e.g. gripping) of an object for any position of the object or the exact determination of certain areas of the object such as the location of a barcode.

Im Folgenden sind verschiedene Ausführungsbeispiele angegeben.Various exemplary embodiments are specified below.

Ausführungsbeispiel 1 ist das oben beschriebene Verfahren zum Lokalisieren von Stellen von Objekten aus Kamerabildern der Objekte.Embodiment 1 is the method described above for locating locations of objects from camera images of the objects.

Ausführungsbeispiel 2 ist das Verfahren nach Ausführungsbeispiel 1, wobei die relative Lage die paarweisen Abstände der zu lokalisierenden Stellen bzw. der lokalisierten Stellen im dreidimensionalen Raum aufweist.Exemplary embodiment 2 is the method according to exemplary embodiment 1, with the relative position having the paired distances of the points to be localized or of the localized points in three-dimensional space.

Damit wird bewertet, inwieweit die lokalisierten Stellen korrekt im dreidimensionalen Raum lokalisiert werden. Insbesondere wird durch die Verwendung von Abständen im dreidimensionalen Raum (statt nur Abständen in der Kamerabildebene) sichergestellt, dass Detektionsfehler, die sich lediglich in einer Abweichung senkrecht zur Kamerabildebene widerspiegeln, berücksichtigt werden.This evaluates the extent to which the localized points are correctly localized in three-dimensional space. In particular, the use of distances in three-dimensional space (instead of just distances in the camera image plane) ensures that detection errors that are only reflected in a deviation perpendicular to the camera image plane are taken into account.

Ausführungsbeispiel 3 ist das Verfahren nach Ausführungsbeispiel 1 oder 2, wobei die zu lokalisierenden Stellen für den Objekttyp auf einem Referenz-Kamerabild eines Objekts des Objekttyps festgelegt werden, das Referenz-Kamerabild dem maschinellen Lernmodell zugeführt wird und der Referenzsatz von Deskriptoren auf die Deskriptoren der zu lokalisierenden Stellen in dem von dem maschinellen Lernmodell für das Referenz-Kamerabild ausgegebenen Deskriptorbild gesetzt wird.Embodiment 3 is the method according to embodiment 1 or 2, wherein the locations to be localized for the object type are specified on a reference camera image of an object of the object type, the reference camera image is fed to the machine learning model and the reference set of descriptors is based on the descriptors of the to locating locations in the descriptor image output by the machine learning model for the reference camera image.

Auf diese Weise können einfach die zu lokalisierenden Stellen und ein guter Anfangswert für den Referenzsatz von Deskriptoren festgelegt werden, da dieser Referenzsatz zumindest für das Referenz-Kamerabild den zu lokalisierenden Stellen entspricht. Im Laufe der Folge von Kamerabildern kann der Referenzsatz, der zum Lokalisieren der zu lokalisierenden Stellen verwendet wird, dann verbessert werden.In this way, the locations to be localized and a good initial value for the reference set of descriptors can easily be determined, since this reference set corresponds to the locations to be located, at least for the reference camera image. Over the course of the sequence of camera images, the reference set used to locate the locations to be located can then be improved.

Ausführungsbeispiel 4 ist das Verfahren nach einem der Ausführungsbeispiele 1 bis 3, wobei der Testsatz innerhalb eines beschränkten Bereichs des Anfangssatzes von Deskriptoren gewählt wird.Embodiment 4 is the method according to any one of embodiments 1 to 3, wherein the test set is chosen within a restricted range of the initial set of descriptors.

Der beschränkte Bereich erlaubt beispielsweise nur eine gewisse relative Abweichung der Deskriptoren des Testsatzes von denen des Anfangssatzes. Dadurch kann vermieden werden, dass Deskriptoren, die zu stark von dem Anfangssatz abweichen, und deshalb mit geringer Wahrscheinlichkeit für mehrere Kamerabilder geeignet sind, (unnötig) getestet werden oder aufgrund eines (oder weniger) Kamerabilder sogar in den Referenzsatz aufgenommen werden. Insbesondere kann ein instabiles Verhalten des Deskriptor-Anpassungsprozesses vermieden werden.For example, the restricted range allows only some relative deviation of the descriptors of the test set from those of the initial set. This can prevent descriptors that deviate too much from the initial set and are therefore suitable for a number of camera images with a low probability from being (unnecessarily) tested or even being included in the reference set due to one (or fewer) camera images. In particular, an unstable behavior of the descriptor adaptation process can be avoided.

Ausführungsbeispiel 5 ist das Verfahren nach einem der Ausführungsbeispiele 1 bis 4, aufweisend Ermitteln des Testsatzes von Deskriptoren mittels eines Kovarianzmatrix-Anpassungs-Evolutions-Strategie-Verfahrens.Exemplary embodiment 5 is the method according to any one of exemplary embodiments 1 to 4, comprising determining the test set of descriptors using a covariance matrix adaptation evolution strategy method.

Dies ermöglicht die effiziente Optimierung einer Funktion, die nicht in geschlossener Form vorliegt, in diesem Fall die Abbildung von Deskriptoren auf die erzielte Genauigkeit der Ermittlung der Stellen des Objekts, im Falle einer hohen Anzahl von Auswertungspunkten, wie sie sich im Laufe der Kamerabilder ergibt.This enables the efficient optimization of a function that is not in closed form, in this case the mapping of descriptors to the accuracy achieved in determining the locations of the object, in the case of a large number of evaluation points, as is the case over the course of the camera images.

Ausführungsbeispiel 6 ist das Verfahren zum Steuern eines Roboters, das Folgendes aufweist:

Lokalisieren von Stellen eines durch den Roboter zu behandelnden Objekts nach einem der Ausführungsbeispiele 1 bis 5,
Ermitteln einer Pose des Objekts aus den lokalisierten Stellen und Steuern des Roboters in Abhängigkeit von der ermittelten Pose
und/oder
Ermitteln eines (z.B. abzutastenden oder anderweitig für die Verarbeitung bzw. Steuerung relevanten) Bereichs des Objekts aus den lokalisierten Stellen und Steuern des Roboters in Abhängigkeit des ermittelten Bereichs.

Embodiment 6 is the method for controlling a robot, including:

Locating locations of an object to be treated by the robot according to any one of embodiments 1 to 5,
Determining a pose of the object from the localized points and controlling the robot depending on the determined pose
and or
Determining an area of the object (eg to be scanned or otherwise relevant for the processing or control) from the localized points and controlling the robot depending on the area determined.

Ausführungsbeispiel 7 ist ein Software- oder Hardware-Agent, insbesondere Roboter, der Folgendes aufweist:

eine Kamera, die zum Bereitstellen von Kamerabildern von Objekten eingerichtet ist;
eine Steuereinrichtung, die zum Durchführen des Verfahrens nach einem der Ausführungsbeispiele 1 bis 6 eingerichtet ist.

Embodiment 7 is a software or hardware agent, particularly robot, comprising:

a camera configured to provide camera images of objects;
a control device that is set up to carry out the method according to one of the exemplary embodiments 1 to 6.

Ausführungsbeispiel 8 ist der Software- oder Hardware-Agent nach Ausführungsbeispiel 7, der mindestens einen Aktor aufweist, wobei die Steuereinrichtung zum Steuern des mindestens einen Aktors unter Verwendung der lokalisierten Stellen eingerichtet ist. Embodiment 8 is the software or hardware agent of embodiment 7 having at least one actor, wherein the controller is configured to control the at least one actor using the located locations.

Ausführungsbeispiel 9 ist ein Computerprogramm, das Anweisungen umfasst, die bei Ausführung durch einen Prozessor veranlassen, dass der Prozessor ein Verfahren nach einem der Ausführungsbeispiele 1 bis 6 durchführt.Embodiment 9 is a computer program that includes instructions that, when executed by a processor, cause the processor to perform a method according to any one of Embodiments 1-6.

Ausführungsbeispiel 10 ist ein Computer lesbares Medium, das Anweisungen speichert, die bei Ausführung durch einen Prozessor veranlassen, dass der Prozessor ein Verfahren nach einem der Ausführungsbeispiele 1 bis 6 durchführt.Embodiment 10 is a computer-readable medium storing instructions that, when executed by a processor, cause the processor to perform a method according to any one of Embodiments 1-6.

In den Zeichnungen beziehen sich im Allgemeinen überall in den verschiedenen Ansichten ähnliche Bezugszeichen auf die gleichen Teile. Die Zeichnungen sind nicht notwendigerweise maßstabsgetreu, stattdessen wird der Schwerpunkt allgemein auf die Veranschaulichung der Prinzipien der Erfindung gelegt. In der folgenden Beschreibung werden verschiedene Aspekte unter Bezugnahme auf die folgenden Zeichnungen beschrieben:

1 zeigt einen Roboter.
2 veranschaulicht das Trainieren eines neuronalen Netzes gemäß einer Ausführungsform.
3 veranschaulicht die Ermittlung einer Objektpose bzw. Greifpose gemäß einer Ausführungsform.
4 veranschaulicht das Anpassen von Deskriptoren, die zum Lokalisieren von Stellen von Objekten verwendet werden, im Lauf einer Folge von Kamerabildern.
5 zeigt ein Ablaufdiagramm für ein Verfahren zum Lokalisieren von Stellen von Objekten aus Kamerabildern der Objekte.

In the drawings, like reference characters generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various aspects are described with reference to the following drawings:

1 shows a robot.
2 12 illustrates training a neural network according to one embodiment.
3 illustrates the determination of an object pose or gripping pose according to an embodiment.
4 Figure 12 illustrates the adjustment of descriptors used to locate locations of objects over a sequence of camera images.
5 shows a flowchart for a method for locating locations of objects from camera images of the objects.

Die folgende ausführliche Beschreibung bezieht sich auf die begleitenden Zeichnungen, die veranschaulichend spezifische Einzelheiten und Aspekte dieser Offenbarung zeigen, in der die Erfindung umgesetzt werden kann. Andere Aspekte können genutzt werden und strukturelle, logische und elektrische Änderungen können vorgenommen werden, ohne vom Schutzumfang der Erfindung abzuweichen. Die verschiedenen Aspekte dieser Offenbarung schließen sich nicht notwendigerweise gegenseitig aus, da manche Aspekte dieser Offenbarung mit einem oder mehreren anderen Aspekten dieser Offenbarung kombiniert werden können, um neue Aspekte zu bilden.The following detailed description refers to the accompanying drawings that show, by way of illustration, specific details and aspects of this disclosure in which the invention may be practiced. Other aspects may be utilized and structural, logical, and electrical changes may be made without departing from the scope of the invention. The various aspects of this disclosure are not necessarily mutually exclusive, as some aspects of this disclosure can be combined with one or more other aspects of this disclosure to form new aspects.

Im Folgenden werden verschiedene Beispiele ausführlicher beschrieben.Various examples are described in more detail below.

1 zeigt einen Roboter 100. 1 shows a robot 100.

Der Roboter 100 beinhaltet einen Roboterarm 101, zum Beispiel einen Industrieroboterarm zum Handhaben oder Montieren eines Arbeitsstücks (oder eines oder mehrerer anderer Objekte). Der Roboterarm 101 beinhaltet Manipulatoren 102, 103, 104 und eine Basis (oder Stütze) 105, mittels der die Manipulatoren 102, 103, 104 gestützt werden. Der Ausdruck „Manipulator“ bezieht sich auf die bewegbaren Bauteile des Roboterarms 101, deren Betätigung eine physische Interaktion mit der Umgebung ermöglicht, um z. B. eine Aufgabe auszuführen. Zur Steuerung beinhaltet der Roboter 100 eine (Roboter-) Steuereinrichtung 106, die zum Implementieren der Interaktion mit der Umgebung gemäß einem Steuerprogramm ausgelegt ist. Das letzte Bauteil 104 (das am weitesten von der Basis 105 entfernt ist) der Manipulatoren 102, 103, 104 wird auch als der Endeffektor 104 bezeichnet und kann ein oder mehrere Werkzeuge beinhalten, wie etwa einen Schweißbrenner, ein Greifinstrument, ein Lackiergerät oder dergleichen.The robot 100 includes a robotic arm 101, for example an industrial robotic arm, for manipulating or assembling a work piece (or other object(s)). The robot arm 101 includes manipulators 102, 103, 104 and a base (or support) 105 by which the manipulators 102, 103, 104 are supported. The term "manipulator" refers to the movable components of the robotic arm 101, the operation of which enables physical interaction with the environment, e.g. B. to perform a task. For control, the robot 100 includes a (robot) controller 106, which is designed to implement the interaction with the environment according to a control program. The final component 104 (farthest from the base 105) of the manipulators 102, 103, 104 is also referred to as the end effector 104 and may include one or more tools, such as a welding torch, gripping instrument, paint gun, or the like.

Die anderen Manipulatoren 102, 103 (die sich näher an der Basis 105 befinden) können eine Positionierungsvorrichtung bilden, sodass, zusammen mit dem Endeffektor 104, der Roboterarm 101 mit dem Endeffektor 104 an seinem Ende bereitgestellt wird. Der Roboterarm 101 ist ein mechanischer Arm, der ähnliche Funktionen wie ein menschlicher Arm bereitstellen kann (möglicherweise mit einem Werkzeug an seinem Ende).The other manipulators 102, 103 (which are closer to the base 105) can form a positioning device so that, together with the end effector 104, the robotic arm 101 is provided with the end effector 104 at its end. The robotic arm 101 is a mechanical arm that can provide functions similar to a human arm (possibly with a tool at its end).

Der Roboterarm 101 kann Gelenkelemente 107, 108, 109 beinhalten, die die Manipulatoren 102, 103, 104 miteinander und mit der Basis 105 verbinden. Ein Gelenkelement 107, 108, 109 kann ein oder mehrere Gelenke aufweisen, die jeweils eine drehbare Bewegung (d. h. Drehbewegung) und/oder translatorische Bewegung (d. h. Verlagerung) für assoziierte Manipulatoren relativ zueinander bereitstellen können. Die Bewegung der Manipulatoren 102, 103, 104 kann mittels Aktoren initiiert werden, die durch die Steuereinrichtung 106 gesteuert werden.The robotic arm 101 may include articulation elements 107, 108, 109 that connect the manipulators 102, 103, 104 to each other and to the base 105. An articulation element 107, 108, 109 may comprise one or more articulations, each of which can provide rotational movement (i.e. rotational movement) and/or translational movement (i.e. translation) for associated manipulators relative to one another. The movement of the manipulators 102, 103, 104 can be initiated by means of actuators that are controlled by the control device 106.

Der Ausdruck „Aktor“ kann als eine Komponente verstanden werden, die als Reaktion auf ihren Antrieb zum Bewirken eines Mechanismus oder Prozesses ausgebildet ist. Der Aktor kann durch die Steuereinrichtung 106 erstellte Anweisungen (die sogenannte Aktivierung) in mechanische Bewegungen implementieren. Der Aktor, z. B. ein elektromechanischer Wandler, kann dazu ausgelegt sein, als Reaktion auf seinen Antrieb elektrische Energie in mechanische Energie umzuwandeln.The term "actuator" can be understood as a component configured to effect a mechanism or process in response to its impetus. The actuator can implement instructions (the so-called activation) created by the controller 106 into mechanical movements. The actor, e.g. An electromechanical converter, for example, may be configured to convert electrical energy into mechanical energy in response to being driven.

Der Ausdruck „Steuereinrichtung“ kann als ein beliebiger Typ von logikimplementierender Entität verstanden werden, die zum Beispiel eine Schaltung und/oder einen Prozessor beinhalten kann, die/der in der Lage ist, in einem Speicherungsmedium gespeicherte Software, Firmware oder eine Kombination davon auszuführen, und die/der Anweisungen, z. B. zu einem Aktor im vorliegenden Beispiel, ausstellen kann. Die Steuereinrichtung kann zum Beispiel durch Programmcode (z. B. Software) konfiguriert werden, um den Betrieb eines Systems, eines Roboters im vorliegenden Beispiel, zu steuern.The term "controller" may be understood as any type of logic implementing entity, which may include, for example, circuitry and/or a processor capable of executing software, firmware, or a combination thereof stored on a storage medium, and the instructions, e.g. B. to an actuator in this example. For example, the controller may be configured by program code (e.g., software) to control operation of a system, a robot in the present example.

Im vorliegenden Beispiel beinhaltet die Steuereinrichtung 106 einen oder mehrere Prozessoren 110 und einen Speicher 111, der Code und Daten speichert, basierend auf denen der Prozessor 110 den Roboterarm 101 steuert. Gemäß verschiedenen Ausführungsformen steuert die Steuereinrichtung 106 den Roboterarm 101 auf Basis eines maschinellen Lernmodells 112, das im Speicher 111 gespeichert ist.In the present example, the controller 106 includes one or more processors 110 and a memory 111 storing code and data based on which the processor 110 controls the robotic arm 101 . According to various embodiments, the controller 106 controls the robotic arm 101 based on a machine learning model 112 stored in the memory 111 .

Gemäß verschiedenen Ausführungsformen wird das maschinelle Lernmodell 112 dazu ausgelegt und trainiert, dem Roboter 100 zu ermöglichen, aus Kamerabildern eine Aufnehmpose eines Objekts 113 zu erkennen, das zum Beispiel in einen Arbeitsraum des Roboterarms 101 platziert wird, zum Beispiel für einen Roboter, der Objekte aus einer Kiste aufnehmen soll (engl. „Bin-picking“). Dies bedeutet, dass der Roboter 100 erkennt, wie er das Objekt 113 aufnehmen kann, d.h. wie er seinen Endeffektor 104 orientieren muss und wohin er ihn bewegen muss, um das Objekt 113 aufzunehmen (z.B. zu greifen). Die Aufnehmpose wird so verstanden, dass sie für das Aufnehmen ausreichend Informationen enthält, d.h. Information über Orientierung und Position 113 des Objekts, die ausreicht, um daraus zu ermitteln, wie das Objekt 113 gegriffen werden kann. Die Aufnehmpose braucht nicht notwendigerweise die vollständige Orientierungsinformation über das Objekt 113 enthalten, da es bei einem Objekt 113 mit einem rotationssymmetrischen Teil zum Greifen beispielsweise unerheblich sein kann, wie das rotationssymmetrische Teil um seine Rotationsachse rotiert ist.According to various embodiments, the machine learning model 112 is designed and trained to enable the robot 100 to recognize from camera images a pickup pose of an object 113 that is placed, for example, in a workspace of the robot arm 101, for example for a robot that picks up objects to pick up a crate ("bin-picking"). This means that the robot 100 recognizes how it can pick up the object 113, i.e. how it has to orient its end effector 104 and where to move it in order to pick up the object 113 (e.g. grab it). The pick-up pose is understood to contain sufficient information for picking, i.e. information about the orientation and position 113 of the object, sufficient to determine from it how the object 113 can be gripped. The recording pose does not necessarily have to contain the complete orientation information about the object 113 since, for example, in the case of an object 113 with a rotationally symmetrical part for gripping, it may be irrelevant how the rotationally symmetrical part is rotated about its axis of rotation.

Der Roboter 100 kann zum Beispiel mit einer oder mehreren Kameras 114 ausgestattet sein, die es ihm ermöglichen, Bilder seines Arbeitsraums aufzunehmen. Eine Kamera 114 ist zum Beispiel an dem Roboterarm 101 befestigt, sodass der Roboter Bilder des Objekts 113 von verschiedenen Perspektiven aus machen kann, indem er den Roboterarm 101 herumbewegt.For example, the robot 100 may be equipped with one or more cameras 114 that enable it to capture images of its workspace. For example, a camera 114 is attached to the robot arm 101 so that the robot can take pictures of the object 113 from different perspectives by moving the robot arm 101 around.

Ein Beispiel für ein maschinelles Lernmodell 112 zur Objekterkennung ist ein dichtes Objektnetz. Ein dichtes Objektnetz bildet ein Bild (z. B. ein durch die Kamera 114 bereitgestelltes RGB-Bild) auf ein Deskriptorraumbild mit einer gewissen gewählten Dimension D ab. Es können aber auch andere maschinelle Lernmodelle 112 verwendet werden, insbesondere solche, die nicht notwendigerweise eine „dichte“ Merkmalskarte erzeugen, sondern lediglich bestimmten Punkten (z.B. Ecken) des Objekts Deskriptorwerte zuordnen.An example of a machine learning model 112 for object recognition is a dense object mesh. A dense object mesh maps an image (e.g., an RGB image provided by the camera 114) onto a descriptor space image of some dimension D chosen. However, other machine learning models 112 can also be used, in particular those that do not necessarily generate a "dense" feature map, but merely assign descriptor values to certain points (e.g. corners) of the object.

Gemäß verschiedenen Ausführungsformen wird ein Ansatz zum Erkennen eines Objekts und seiner Stellung verwendet, unter der Annahme, dass ein 3D-Modell (z. B. ein CAD(Computer Aided Design)-Modell) des Objekts bekannt ist, was typischerweise für Industriemontage- oder -bearbeitungsaufgaben der Fall ist. Es können nichtlineare Dimensionalitätsreduktionstechniken verwendet werden, um optimale Zielbilder zum Trainieren von Eingabebildern für ein neuronales Netz zu berechnen. Somit wird gemäß verschiedenen Ausführungsformen ein überwachtes Training eines neuronalen Netzes verwendet. Es können auch RGBD-Bilder von einem Objekt aufgenommen werden (RGB + Tiefeninformation) und daraus ein 3D-Modell für das Objekt bestimmt werden. Alternativ kann ein selbstüberwachtes Training durchgeführt werden, bei dem das maschinelle Lernmodell selbst Deskriptoren für Stellen des Objekts lernt.According to various embodiments, an approach for recognizing an object and its pose is used, assuming that a 3D model (e.g., a CAD (Computer Aided Design) model) of the object is known, which is typically used for industrial assembly or processing tasks is the case. Non-linear dimensionality reduction techniques can be used to calculate optimal target images for training input images for a neural network. Thus, according to various embodiments, supervised neural network training is used. RGBD images of an object can also be recorded (RGB + depth information) and a 3D model for the object can be determined from them. Alternatively, self-supervised training can be performed, in which the machine learning model learns descriptors for locations of the object itself.

Für ein überwachtes Training wird gemäß einer Ausführungsform zum Erzeugen von Trainingsdaten zum Trainieren des maschinellen Lernmodells 112 zunächst eine Datensammlung durchgeführt. Insbesondere werden zum Beispiel registrierte RGB(Rot-Grün-Blau)-Bilder gesammelt. Ein registriertes Bild bedeutet hier ein RGB-Bild mit bekannten intrinsischen und extrinsischen Kamerawerten. In einem Szenario in der realen Welt wird eine an einem Roboter befestigte Kamera 114 (z. B. eine an einem Roboterhandgelenk befestigte Kamera) zum Beispiel verwendet, um ein Objekt zu scannen, während sich der Roboter (z. B. Roboterarm 101) herumbewegt. In einem simulierten Szenario werden fotorealistisch erzeugte RGB-Bilder unter Verwendung bekannter Objektstellungen verwendet.According to one embodiment for generating training data for training the machine learning model 112 for a supervised training, a data collection is first carried out. Specifically, for example, registered RGB (Red-Green-Blue) images are collected. A registered image here means an RGB image with known intrinsic and extrinsic camera values. For example, in a real-world scenario, a robot-mounted camera 114 (e.g., a robot wrist-mounted camera) is used to scan an object as the robot (e.g., robotic arm 101) moves around . In a simulated scenario, photorealistic generated RGB images are used using known object poses.

Nach dem Sammeln der RGB-Bilder werden für überwachtes Training eines neuronalen Netzes Zielbilder für die RGB-Bilder gerendert.After collecting the RGB images, target images for the RGB images are rendered for supervised neural network training.

Es wird angenommen, dass die Stellung jedes Objekts in Weltkoordinaten in jedem gesammelten RGB-Bild bekannt ist. Dies ist für ein simuliertes Szenario unkompliziert, erfordert aber eine manuelle Abstimmung für ein Szenario in der realen Welt, z. B. Platzieren des Objekts an vordefinierten Positionen. Es können auch RGBD (RGB plus Tiefeninformation)-Bilder verwendet werden, um die Position eines Objekts zu bestimmen.It is assumed that the pose of each object in world coordinates in each collected RGB image is known. This is straightforward for a simulated scenario, but requires manual tuning for a real-world scenario, e.g. B. Placing the object at predefined positions. RGBD (RGB plus depth information) images can also be used to determine the position of an object.

Mit diesen Informationen und unter Verwendung einer Vertexdeskriptorberechnungstechnik, wie zum Beispiel nachstehend beschrieben, für jedes RGB-Bild (d. h. Trainingseingabebild) wird ein Deskriptorbild (d. h. Trainingsausgabebild, auch als Zielbild oder Ground-Truth-Bild bezeichnet) gerendert. With this information and using a vertex descriptor computation technique, such as described below, for each RGB image (i.e., training input image), a descriptor image (i.e., training output image, also known as a target image or ground truth image) is rendered.

Wenn für jedes RGB-Bild ein Zielbild erzeugt wurde, d. h. Paare von RGB-Bildern und Zielbildern gebildet wurden, können diese Paare von Trainingseingabebild und assoziiertem Zielbild als Trainingsdaten zum Trainieren eines neuronalen Netzes verwendet werden, wie in 2 veranschaulicht.If a target image has been generated for each RGB image, i.e. pairs of RGB images and target images have been formed, these pairs of training input image and associated target image can be used as training data for training a neural network, as in 2 illustrated.

2 veranschaulicht das Trainieren eines neuronalen Netzes 200 gemäß einer Ausführungsform. 2 2 illustrates training a neural network 200 according to one embodiment.

Das neuronale Netz 200 ist ein voll faltendes Netz (engl. fully convolutional network), das einen h × w × 3-Tensor (Eingabebild) auf einen h × w × D-Tensor (Ausgabebild) abbildet.The neural network 200 is a fully convolutional network that maps an h×w×3 tensor (input image) to an h×w×D tensor (output image).

Es umfasst mehrere Stufen 204 von Faltungsschichten, gefolgt von einer Pooling-Schicht, Upsampling-Schichten 205 und Skip-Verbindungen 206, um die Ausgaben verschiedener Schichten zu kombinieren.It comprises several stages 204 of convolutional layers, followed by a pooling layer, upsampling layers 205 and skip connections 206 to combine the outputs of different layers.

Für das Training empfängt das neuronale Netz 200 ein Trainingseingabebild 201 und gibt ein Ausgabebild 202 mit Pixelwerten im Deskriptorraum (z. B. Farbkomponenten gemäß Deskriptorvektorkomponenten) aus. Ein Trainingsverlust wird zwischen dem Ausgabebild 202 und dem mit dem Trainingseingabebild assoziierten Zielbild 203 berechnet. Dies kann für einen Stapel von Trainingseingabebildern stattfinden und der Trainingsverlust kann über die Trainingseingabebilder gemittelt werden und die Gewichte des neuronalen Netzes 200 werden unter Verwendung stochastischen Gradientenabstiegs unter Verwendung des Trainingsverlustes trainiert. Der zwischen dem Ausgabebild 202 und dem Zielbild 203 berechnete Trainingsverlust ist zum Beispiel eine L2-Verlustfunktion (um einen pixelweisen Least-Square-Fehler zwischen dem Zielbild 203 und dem Ausgabebild 202 zu minimieren).For training, the neural network 200 receives a training input image 201 and outputs an output image 202 with pixel values in descriptor space (e.g. color components according to descriptor vector components). A training loss is calculated between the output image 202 and the target image 203 associated with the training input image. This can take place for a batch of training input images and the training loss can be averaged over the training input images and the weights of the neural network 200 trained using stochastic gradient descent using the training loss. For example, the training loss computed between the output image 202 and the target image 203 is an L2 loss function (to minimize a pixel-by-pixel least squares error between the target image 203 and the output image 202).

Das Trainingseingabebild 201 zeigt ein Objekt und das Zielbild sowie das Ausgabebild beinhalten Vektoren im Deskriptorraum. Die Vektoren im Deskriptorraum können auf Farben abgebildet werden, sodass das Ausgabebild 202 (sowie das Zielbild 203) einer Heatmap des Objekts ähnelt.The training input image 201 shows an object and the target image and the output image contain vectors in descriptor space. The vectors in the descriptor space can be mapped to colors so that the output image 202 (as well as the target image 203) resembles a heat map of the object.

Die Vektoren im Deskriptorraum (auch als (dichte) Deskriptoren bezeichnet) sind d-dimensionale Vektoren (z. B. beträgt d 1, 2 oder 3), die jedem Pixel im jeweiligen Bild (z. B. jedem Pixel des Eingabebildes 201, unter der Annahme, dass das Eingabebild 201 und das Ausgabebild 202 die gleiche Dimension aufweisen) zugewiesen sind. Die dichten Deskriptoren codieren implizit die Oberflächentopologie des im Eingabebild 201 gezeigten Objekts, invariant gegenüber seiner Stellung oder der Kameraposition.The vectors in the descriptor space (also called (dense) descriptors) are d-dimensional vectors (e.g. d is 1, 2 or 3) corresponding to each pixel in the respective image (e.g. each pixel of the input image 201, under assuming that the input image 201 and the output image 202 have the same dimension). The dense descriptors implicitly encode the surface topology of the object shown in the input image 201, invariant to its pose or camera position.

Wenn ein 3D-Modell des Objekts gegeben ist, ist es möglich, einen optimalen und eindeutigen Deskriptor für jeden Vertex des 3D-Modells des Objekts analytisch zu bestimmen. Gemäß verschiedenen Ausführungsformen werden unter Verwendung dieser optimalen Deskriptoren (oder Schätzungen dieser Deskriptoren, die durch eine Optimierung bestimmt werden) Zielbilder für registrierte RGB-Bilder erzeugt, was zu einen voll überwachten Training des neuronalen Netzes 200 führt. Zusätzlich wird der Deskriptorraum ungeachtet der gewählten Deskriptordimension d erläuterbar und optimal.Given a 3D model of the object, it is possible to analytically determine an optimal and unique descriptor for each vertex of the 3D model of the object. According to various embodiments, target images are generated for registered RGB images using these optimal descriptors (or estimates of these descriptors determined by an optimization), resulting in fully supervised neural network 200 training. In addition, the descriptor space becomes explainable and optimal regardless of the chosen descriptor dimension d.

Ist nun das maschinelle Lernmodell 112, z.B. das neuronale Netz 200, zum Abbilden von Kamerabildern eines Objekts 113 auf Deskriptorbilder trainiert, so kann zur Ermitteln einer Aufnehmpose eines Objekts 113 in unbekannter Lage wie folgt vorgegangen werden.If the machine learning model 112, e.g. the neural network 200, has now been trained to map camera images of an object 113 onto descriptor images, the following procedure can be used to determine a recording pose of an object 113 in an unknown position.

Zunächst werden auf dem Objekt mehrere Referenzpunkte p_i, i = 1, ...,N, auf dem Objekt 113 ausgewählt und Deskriptoren dieser Referenzpunkte ermittelt. Diese Referenzpunkte sind die zu lokalisierenden Stellen des Objekts für ein späteres („neues“) Kamerabild und die Deskriptoren der Referenzpunkte ist der (erste, d.h. anfängliche) Referenzsatz von Deskriptoren (der im Laufe einer Folge von Kamerabildern angepasst werden kann wie unten beschrieben). Die Auswahl kann dadurch erfolgen, dass ein Kamerabild des Objekts 113 aufgenommen wird, Referenzpixel (u_i, v_i) auf dem Objekt (und damit entsprechend Referenzpunkte des Objekts) ausgewählt werden und das Kamerabild von dem neuronalen Netz 200 auf ein Deskriptorbild abgebildet wird. Dann können die Deskriptoren an den Positionen im Deskriptorbild, die durch die Positionen der Referenzpixel gegeben sind, als Deskriptoren der Referenzpunkte genommen werden, d.h. die Deskriptoren der Referenzpunkte sind d_i = I^d(u_i, v_i), wobei I^d = f(I; θ) das Deskriptorbild ist, wobei f die von dem neuronalen Netz implementierte Abbildung (von Kamerabild auf Deskriptorbild) ist, / das Kamerabild und θ die Gewichte des maschinellen Lernmodells 200.First, several reference points p _i , i=1, . . . , N, are selected on the object 113 and descriptors of these reference points are determined. These reference points are the locations of the object to be located for a later ("new") camera image and the reference point descriptors is the (first, i.e. initial) reference set of descriptors (which can be adjusted over the course of a sequence of camera images as described below). The selection can be made by taking a camera image of the object 113, selecting reference pixels (u _i , v _i ) on the object (and thus corresponding reference points of the object) and mapping the camera image from the neural network 200 to a descriptor image. Then the descriptors at the positions in the descriptor image given by the positions of the reference pixels can be taken as descriptors of the reference points, ie the descriptors of the reference points are di = I ^d (u _i _{, vi} ₎ , where I ^d = f (I; θ) is the descriptor image, where f is the mapping (from camera image to descriptor image) implemented by the neural network, / is the camera image and θ is the weights of the machine learning model 200.

Befindet sich nun das Objekt 113 in einer unbekannten Lage, wird ein neues Kamerabild I_neu aufgenommen und mittels des maschinellen Lernmodells ein zugehöriges Deskriptorbild I^d _neu = f{I_neu; θ) ermittelt. In diesem neuen Deskriptorbild werden nun Deskriptoren gesucht, die möglichst nah an den d_i Deskriptoren der Referenzbilder liegen, beispielsweise durch
(u_i, v_i)* = argmin u_i,v_i || I^d _neu (u_i,v_i) - d_i ||₂ ² für alle i = 1, ..., N.If the object 113 is now in an unknown position, a new camera image I _new is recorded and an associated descriptor image I ^d _new =f{I _new ; θ) determined. In this new descriptor image, descriptors are now sought which are as close as possible to the d _i descriptors of the reference images, for example by
(u _i ,v _i )* = argmin u _i ,v _i || I ^d _new (u _i ,v _i ) - d _i || ₂ ² for all i = 1, ..., N.

Aus den so ermittelten bzw. geschätzten Positionen (u_i, v_i)* der Referenzpunkte im Deskriptorbild I^d _neu (und damit entsprechend im neuen Kamerabild I_neu) werden die Positionen der Referenzpunkte im dreidimensionalen Raum ermittelt. Beispielsweise wird zusammen mit dem Kamerabild I_neu ein Tiefenbild aufgenommen (oder das Kamerabild I_neu hat einen Tiefenkanal, z.B. ist es ein RGBD-Bild), sodass aus (u_i, v_i)* die dreidimensionale Position des i-ten Referenzpunkts p_i ermittelt werden kann (durch Projizieren des Tiefenwerts an der Position (u_i, v_i)* in das jeweilige Arbeitsbereich-Koordinatensystem).The positions of the reference points are derived from the thus determined or estimated positions (u _i , v _i )* of the reference points in the descriptor image I ^d _new (and thus correspondingly in the new camera image I _new ). determined in three-dimensional space. For example, a depth image is recorded together with the camera image I _new (or the camera image I _new has a depth channel, e.g. it is an RGBD image), so that from (u _i , v _i )* the three-dimensional position of the i-th reference point p _i can be determined (by projecting the depth value at the position (u _i ,v _i )* into the respective workspace coordinate system).

Sind die Positionen im Raum von mehreren Referenzpunkten bekannt, kann daraus eine Aufnehmpose ermittelt werden, wie es in 3 dargestellt ist.If the positions in space of several reference points are known, a recording pose can be determined from this, as is shown in 3 is shown.

Beispielsweise wird die Position von zwei Referenzpunkten p₁ und p₂ auf dem Objekt 300 im Raum ermittelt und die beiden linear kombiniert, z.B. ihr Mittelwert genommen, um einen Ankerpunkt 304 festzulegen. Um eine Greif-Orientierung zu definieren wird eine erste Achse 301 mit der Richtung p₁ und p₂ festgelegt und eine zweite Achse 302 durch den Ankerpunkt 304 beliebig festgelegt, z.B. in z-Achsenrichtung der Kamera 114 oder in Richtung einer Achse des Arbeitsbereich-Koordinatensystems. Eine dritte Achse 303 durch den Ankerpunkt 304 kann durch das Vektorprodukt aus dem Richtungsvektor der ersten Achse 301 und dem Richtungsvektor der zweiten Achse 302 berechnet werden. Die drei Achsen 301 bis 303 und der Ankerpunkt 304 definieren eine Aufnehmpose für das Objekt 300. Ein Roboter kann dann so gesteuert werden, dass er den Schaft des Objekts 300, der sich in Richtung der ersten Achse erstreckt, umfasst. Die Referenzpunkte p₁ und p₂ werden beispielsweise so festgelegt, dass sie sich wie dargestellt entlang des Schafts, also entlang eines zum Greifen geeigneten langgestreckten Teilstücks des Objekts, erstrecken.For example, the position of two reference points p ₁ and p ₂ on the object 300 in space is determined and the two are linearly combined, e.g. their average is taken, in order to define an anchor point 304 . In order to define a gripping orientation, a first axis 301 is defined with the direction p ₁ and p ₂ and a second axis 302 is defined arbitrarily by the anchor point 304, e.g. in the z-axis direction of the camera 114 or in the direction of an axis of the work area coordinate system . A third axis 303 through the anchor point 304 can be calculated by the vector product of the direction vector of the first axis 301 and the direction vector of the second axis 302 . The three axes 301-303 and the anchor point 304 define a pick-up pose for the object 300. A robot can then be controlled to grasp the shaft of the object 300 extending in the direction of the first axis. The reference points p ₁ and p ₂ are defined, for example, such that they extend along the shaft, ie along an elongate part of the object suitable for gripping, as shown.

Analog können drei oder mehr Referenzpunkte auf einer Greiffläche eines Objekts angeordnet werden, sodass aus ihrer Position eine volle 6D-Aufnehmpose des Objekts oder auch die Orientierung einer Greiffläche ermittelt werden kann, an der ein Objekt aufgenommen (gegriffen oder angesaugt) werden kann.Similarly, three or more reference points can be arranged on a gripping surface of an object, so that a full 6D recording pose of the object or the orientation of a gripping surface on which an object can be picked up (gripped or sucked) can be determined from their position.

Es sollte dabei beachtet werden, dass der Greifer nicht notwendig eine Zangenform haben muss, sondern auch beispielsweise eine Saugvorrichtung aufweisen kann, um das Objekt an einer geeigneten Fläche anzusaugen und damit aufzunehmen. Um die Saugvorrichtung in die richtige Stellung zu bringen kann es beispielsweise in diesem Fall gewünscht sein, eine Aufnehmpose zu ermitteln, die die Orientierung und Position einer Fläche des Objekts, die zum Ansaugen geeignet ist, angibt. Dies kann beispielsweise durch Ermittlung eines Ankerpunkts und eines Ebenen-Normalenvektors im Ankerpunkt erfolgen.It should be noted that the gripper does not necessarily have to be in the form of pliers, but can also have a suction device, for example, in order to suck the object on a suitable surface and thus pick it up. In this case, for example, in order to bring the suction device into the correct position, it may be desirable to determine a pickup pose that indicates the orientation and position of a surface of the object that is suitable for suction. This can be done, for example, by determining an anchor point and a plane normal vector in the anchor point.

Es können auch mehr als drei Referenzpunkte verwendet werden, um eine Aufnehmpose zu ermitteln, z.B. um durch Mittelung Fehler zu reduzieren.More than three reference points can also be used to determine a recording pose, e.g. to reduce errors through averaging.

Ähnlich wie eine Aufnehmpose kann die Steuereinrichtung 106 auch Bereiche auf dem Objekt 300 ermittelt, z.B. eine Bounding-Box eines Barcodes, der auf dem Objekt 300 vorgesehen ist, um das Objekt zu identifizieren. Die Steuereinrichtung 106 kann dann beispielsweise den Roboterarm 101 derart steuern, dass er das Objekt 300 so vor eine Kamera hält, dass sie den Barcode lesen kann.Similar to a capture pose, controller 106 may also determine areas on object 300, e.g., a bounding box of a barcode provided on object 300, to identify the object. The control device 106 can then, for example, control the robot arm 101 in such a way that it holds the object 300 in front of a camera in such a way that it can read the barcode.

Wie oben erläutert ordnet ein dichtes Objektnetz einem (z.B. RGB-)Kamerabild eines Objekts oder eines Teils davon, aufgenommen aus einem beliebigen Blickwinkel, ein Deskriptorbild zu, das jedem Pixel (bzw. jeder Position) im eingegebenen Kamerabild einen mehrdimensionalen Deskriptor zuordnet. Die Deskriptoren haben die theoretische Eigenschaft, dass ein bestimmter Punkt auf der Oberfläche des Objekts unabhängig vom Blickwinkel immer mit demselben Deskriptor assoziiert wird. Diese Eigenschaft ist für verschiedene Anwendungen nützlich, z. B. für die Erkennung von Bereichen auf der Oberfläche des Zielobjekts durch die Erkennung mehrerer Deskriptoren, z. B. von Eckpunkten des Bereichs oder einer Aufnehmpose wie oben mit Bezug auf 3 beschrieben. Mit zusätzlichen Tiefeninformationen, d. h. RGBD-Eingangsdaten anstelle von nur RGB-Bildern (d.h. RGB Bilder plus Tiefeninformation), können die ermittelten Punkte in den 3D-Raum projiziert werden, um einen solchen Bereich vollständig zu definieren.As explained above, a dense object network assigns a (eg RGB) camera image of an object or a part of it, recorded from any angle, to a descriptor image that assigns a multidimensional descriptor to each pixel (or each position) in the input camera image. The descriptors have the theoretical property that a given point on the object's surface is always associated with the same descriptor, regardless of the viewing angle. This property is useful for various applications, e.g. B. for the detection of areas on the surface of the target object by the detection of multiple descriptors, z. B. from vertices of the area or a recording pose as above with reference to 3 described. With additional depth information, ie RGBD input data instead of just RGB images (ie RGB images plus depth information), the detected points can be projected into 3D space to fully define such an area.

In der Praxis ist die Zuordnung von Punkten auf dem Objekt zu ansichtsinvarianten Deskriptoren allerdings in der Regel nicht perfekt und die Qualität verschlechtert sich aufgrund von Faktoren wie geringe Anzahl visueller Merkmale, geringe Dimensionalität des Deskriptorraums, unzureichende Trainingsdaten einiger Teile, Mehrdeutigkeit aufgrund von symmetrischen Teilen oder ähnlichen Objekten, signifikant unterschiedliche Blickwinkel oder signifikant unterschiedliche äußere Bedingungen wie Umgebungsbeleuchtung, ungültige Tiefeninformationen aufgrund von spiegelnden Oberflächen, usw. in den Kamerabildern, die die verschiedenen Ansichten eines Objekts zeigen.However, in practice, the mapping of points on the object to view-invariant descriptors is usually not perfect and the quality degrades due to factors such as small number of visual features, low dimensionality of the descriptor space, insufficient training data of some parts, ambiguity due to symmetric parts or similar objects, significantly different viewing angles or significantly different external conditions such as ambient lighting, invalid depth information due to reflective surfaces, etc. in the camera images showing the different views of an object.

Gemäß verschiedenen Ausführungsformen wird die Robustheit gegenüber solchen Faktoren erhöht, indem anfänglich ausgewählte Deskriptoren nicht statisch verfolgt werden, sondern im Laufe einer Folge von Kamerabildern optimiert werden, indem sie (typischerweise leicht) abgeändert werden, sodass eine verbesserte Ermittlung von Stellen auf einem Objekt (und damit eine Pose oder ein durch die Stellen gegebener Bereich, z.B. Eckpunkte einer Bounding-Box) ermöglicht wird. In anderen Worten wird während des Betriebs („online“) die Auswahl der verwendeten Deskriptoren über die Zeit (d.h. im Laufe einer empfangenen zeitlichen Folge von Kamerabildern) verbessert.According to various embodiments, the robustness to such factors is increased by not tracking initially selected descriptors statically, but optimizing them over the course of a sequence of camera images by modifying them (typically slightly) so that improved detection of locations on an object (and thus a pose or an area given by the positions, eg corner points a bounding box). In other words, during operation (“online”) the selection of the descriptors used is improved over time (ie over the course of a received chronological sequence of camera images).

Die Auswahl der Deskriptoren kann mittels eines Black-Box-Optimierungsverfahrens durchgeführt werden. Beispiele hierfür sind Bayessche Optimierung (BO) und CMA-ES (Kovarianzmatrix-Anpassungs-Evolutions-Strategie). Gemäß einer Ausführungsform wird aufgrund ihrer besseren Leistungsfähigkeit für eine hohe Anzahl von Abtastwerten CMA-ES verwendet, da dies in dieser Anwendung typischerweise ein relevanter Aspekt ist.The selection of the descriptors can be performed using a black box optimization method. Examples include Bayesian Optimization (BO) and CMA-ES (Covariance Matrix Adaptation Evolution Strategy). According to one embodiment, CMA-ES is used for a high number of samples due to its better performance, as this is typically a relevant aspect in this application.

Zur Verbesserung der Deskriptoren im Laufe einer Folge von Kamerabildern wird zunächst ein Satz von Deskriptoren (z.B. drei Deskriptoren für eine Posendetektion oder vier Deskriptoren zur Festlegung eines viereckigen Bereichs) als erste Referenz (d.h. Anfangswert für den Satz von Deskriptoren) festgelegt, beispielsweise durch Auswahl einer entsprechenden Anzahl von Stellen auf einem Objekt in einem Kamerabild durch einen Benutzer, die dann von dem maschinellen Lernmodell auf Deskriptoren abgebildet werden. Dieser Referenzsatz von Deskriptoren wird im Laufe der Zeit (d.h. im Laufe der Folge von Kamerabildern) angepasst. Konkret wird er durch einen Testsatz von Deskriptoren ersetzt, wenn dieser eine höhere Genauigkeit liefert. Es gibt also für jedes Kamerabild einen aktuellen Referenzsatz von Deskriptoren, beginnend mit dem als erste Referenz gewählten Deskriptorsatz. Gemäß einer Ausführungsform werden mehrere Testsätze von Deskriptoren parallel, d.h. auf demselben Bild oder derselben Teilfolge von Bildern, ausgewertet und derjenige als neuer Referenzsatz ausgewählt, der die höchste Genauigkeit liefert.To improve the descriptors over the course of a sequence of camera images, a set of descriptors (e.g. three descriptors for pose detection or four descriptors for defining a quadrilateral area) is first set as the first reference (i.e. initial value for the set of descriptors), for example by selecting one corresponding number of locations on an object in a camera image by a user, which are then mapped to descriptors by the machine learning model. This reference set of descriptors is adjusted over time (i.e. over the sequence of camera images). Concretely, it is replaced by a test set of descriptors if this provides higher accuracy. There is therefore a current reference set of descriptors for each camera image, beginning with the descriptor set selected as the first reference. According to one embodiment, several test sets of descriptors are evaluated in parallel, i.e. on the same image or the same subsequence of images, and the one that provides the highest accuracy is selected as the new reference set.

Für jedes Kamerabild der Folge von Kamerabildern wird dann Folgendes durchgeführt:

(A) Für das Kamerabild und den (aktuellen) Referenzsatz von Deskriptoren wird eine Ermittlungsgenauigkeit ermittelt, die angibt, wie gut die zu ermittelnden Stellen des Objekts ermittelt werden (z.B. wie gut ein Zielbereich ermittelt wird)
(B) Ein neuer Satz von Deskriptoren (Testsatz) wird zur Erhöhung der Ermittlungsgenauigkeit für zukünftige Kamerabilder der Folge vorgeschlagen.

The following is then carried out for each camera image of the sequence of camera images:

(A) A determination accuracy is determined for the camera image and the (current) reference set of descriptors, which indicates how well the locations of the object to be determined are determined (e.g. how well a target area is determined)
(B) A new set of descriptors (test set) is proposed to increase the detection accuracy for future camera images of the sequence.

Für (A) wird für die Ermittlungsgenauigkeit ein Qualitätsmaß eines Satzes von Deskriptoren verwendet. Anstelle (oder zusätzlich dazu) Deskriptoren individuell zu bewerten (z.B. durch Berechnen einer Konfidenz, dass eine ermittelte Stelle eindeutig ist) wird die Qualität eines Deskriptorsatzes auf der Grundlage der relativen räumlichen Lage der durch sie ermittelten Stellen auf der Oberfläche des Objekts (im dreidimensionalen Raum) bewertet:

(A1) Für jeden Deskriptor wird die Position (Pixel) in dem Kamerabild ausgewählt, der durch das maschinelle Lernmodell auf den Deskriptor abgebildet wird
(A2) Jede dieser Positionen in der Kamerabildebene wird durch Verwendung von Tiefeninformation (und z.B. intrinsische Kamerakalibrierungsparameter) auf die zugehörige 3D-Position abgebildet
(A3) Ein Maß wird berechnet, das die relative Lage der 3D-Positionen erfasst, z.B. paarweise (z.B. euklidische) Abstände zwischen den 3D-Positionen.
(A4) Das berechnete Maß wird mit einem Referenzmaß verglichen. Das Ergebnis des Vergleichs (z.B. ein Mittel über die Unterschiede der paarweisen Abstände zwischen den 3D-Positionen und Referenzabständen) gibt die Ermittlungsgenauigkeit an. Je geringer das Vergleichsergebnis ist, desto höher ist die Ermittlungsgenauigkeit. Beispielsweise wird der Wert der Ermittlungsgenauigkeit als der Kehrwert des Vergleichsergebnisses festgelegt. Das Referenzmaß bzw. Referenzabstände können beispielsweise in einem ersten Kamerabild ermittelt werden oder auch am echten Objekt gemessen werden. Beispielsweise wird gemessen, wie die Abstände der Eckpunkte eines Barcodes (oder anderer prominenter Stellen) am echten Objekt sind.

For (A) a quality measure of a set of descriptors is used for the determination accuracy. Instead of (or in addition to) evaluating descriptors individually (e.g. by calculating a confidence that a identified location is unique), the quality of a descriptor set is assessed based on the relative spatial location of the locations they identified on the object's surface (in three-dimensional space ) rated:

(A1) For each descriptor, the position (pixel) in the camera image that is mapped to the descriptor by the machine learning model is selected
(A2) Each of these positions in the camera image plane is mapped to the associated 3D position using depth information (and eg intrinsic camera calibration parameters).
(A3) A metric is computed that captures the relative location of the 3D positions, eg, pairwise (eg, Euclidean) distances between the 3D positions.
(A4) The calculated dimension is compared with a reference dimension. The result of the comparison (eg an average of the differences in the paired distances between the 3D positions and reference distances) indicates the determination accuracy. The lower the comparison result, the higher the determination accuracy. For example, the value of the determination accuracy is set as the reciprocal of the comparison result. The reference dimension or reference distances can be determined, for example, in a first camera image or can also be measured on the real object. For example, the distance between the corner points of a barcode (or other prominent points) on the real object is measured.

Für (B) wird ein Optimierungsverfahren für eine (nicht in geschlossener Form bekannte aber durch (A) auswertbare) Zielfunktion ausgeführt, die als Eingabe einen Satz von Deskriptoren erhält und das Genauigkeitsmaß für das Kamerabild ausgibt. (Das Genauigkeitsmaß kann auch ein Mittel über die Genauigkeitsmaße mehrerer Kamerabilder sein.) Ein Black-Box-Optimierungsverfahren kann wie folgt verwendet werden, um den Referenzsatz von Deskriptoren im Laufe der Folge von Kamerabildern zu verbessern (bzw. an geänderte Bedingungen, z.B. geänderte Beleuchtung) anzupassen.
(B1) Wähle eine oder mehrere vielversprechende Testsätze von Deskriptoren aus (wie sie z.B. das Black-Box-Optimierungsverfahren, z.B. gemäß einer Aquisitionsfunktion, vorschlägt)
(B2) Evaluiere den Testsatz und den Referenzsatz durch Ermittlung des Genauigkeitsmaßes für den Testsatz und den Referenzsatz (für ein Kamerabild oder über mehrere Kamerabilder)
(B3) Falls das Genauigkeitsmaß für den Testsatz besser ist als für den Referenzsatz, mache den Testsatz zum neuen Referenzsatz (d.h. Aktualisiere den bisherigen Referenzsatz auf den Testsatz).For (B) an optimization procedure is carried out for a (not known in closed form but evaluable by (A)) objective function, which receives a set of descriptors as input and outputs the accuracy measure for the camera image. (The accuracy measure can also be an average over the accuracy measures of several camera images.) A black-box optimization method can be used as follows to improve the reference set of descriptors over the course of the sequence of camera images (or to changing conditions, e.g. changed lighting). ) to adjust.
(B1) Choose one or more promising test sets of descriptors (e.g. as suggested by the black-box optimization method, e.g. according to an acquisition function)
(B2) Evaluate the test set and the reference set by determining the accuracy measure for the test set and the reference set (for one camera image or across multiple camera images)
(B3) If the measure of accuracy for the test set is better than for the reference set, make the test set the new reference set (ie update the previous reference set to the test set).

Die Optimierung startet mit der ersten Referenz für die Deskriptoren als Referenzsatz. Es kann vorgesehen sein, dass die Deskriptoren nicht komplett frei angepasst werden, sondern nur innerhalb eines beschränkten Bereichs um den anfänglichen Referenzsatz. Damit wird gewährleistet, dass sich die Deskriptoren nicht zu weit von dem anfänglichen Referenzsatz entfernen.The optimization starts with the first reference for the descriptors as the reference set. Provision can be made for the descriptors not to be adjusted completely freely, but only within a limited range around the initial reference set. This ensures that the descriptors do not stray too far from the initial reference set.

Die Steuereinrichtung 106 kann das oben beschriebene Verfahren zum Anpassen der Deskriptoren vollautomatisch durchführen, ohne dass ein menschlicher Benutzer beteiligt ist (außer eventuell am Anfang zum Festlegen der ersten Referenz). Die Verwendung von Kamerabildern aus der Zielanwendung statt eines generischen Datensatzes ermöglicht optimale Leistung. Deshalb wird gemäß einer Ausführungsform das Verfahren online während des Betriebs der Anwendung (z.B. während der Verwendung eines Roboters zum Aufnehmen von Objekten aus einer Kiste) verwendet, um die Qualität der Ergebnisse im Laufe der Zeit automatisch zu verbessern. Für jedes neu empfangene Kamerabild der Folge von Kamerabildern kann die Steuereinrichtung 106 automatisch auswählen, welcher Satz von Deskriptoren die beste Leistung erbringt und für dieses spezielle Bild berücksichtigt werden sollte. Beispielsweise verwendet die Steuereinrichtung 106 für den aktuellen Referenzsatz, falls er besser ist als ein aktueller Testsatz, der für das aktuelle Kamerabild evaluiert wird, oder umgekehrt.The controller 106 can perform the above-described method for adjusting the descriptors fully automatically, without a human user being involved (except possibly initially for setting the first reference). Using camera images from the target application instead of a generic dataset allows for optimal performance. Therefore, according to one embodiment, the method is used online during operation of the application (e.g. while using a robot to pick objects from a box) to automatically improve the quality of the results over time. For each newly received camera image of the sequence of camera images, the controller 106 can automatically select which set of descriptors performs best and should be considered for that particular image. For example, the controller 106 uses the current reference set if it is better than a current test set being evaluated for the current camera image, or vice versa.

4 veranschaulicht die oben beschriebene Vorgehensweise. 4 illustrates the procedure described above.

Zunächst werden manuell Deskriptoren 401 als erste Referenz festgelegt (hier dargestellt durch ihre Positionen in dem zu einem ersten Kamerabild gehörenden ersten Deskriptorbild 402). Für die Ermittlung des Genauigkeitsmaßes nach (A) für den (ersten) Referenzsatz werden ihre paarweisen Abstände 403 ermittelt.First, descriptors 401 are set manually as the first reference (represented here by their positions in the first descriptor image 402 belonging to a first camera image). To determine the degree of accuracy according to (A) for the (first) reference set, their pairwise distances 403 are determined.

Für ein zweites Kamerabild sind die Positionen der Deskriptoren in dem zugehörigen zweiten Deskriptorbild 404 in diesem Beispiel aufgrund von Detektionsfehlern verschoben. Beispielsweise wird ein Deskriptor 405 von dem maschinellen Lernmodell aufgrund von Lichtreflexionen einer falschen Stelle zugeordnet.For a second camera image, the positions of the descriptors in the associated second descriptor image 404 are shifted in this example due to detection errors. For example, a descriptor 405 is mismatched by the machine learning model due to light reflections.

Deshalb wird gemäß (B) ein neuer Deskriptor 406 für den Deskriptorensatz gewählt mittels welchem die paarweisen Abstände zwischen den 3D-Positionen der durch die Deskriptoren gegebenen Stellen besser mit den ermittelten Abständen 403 übereinstimmen, d.h. die Form des ermittelten Bereichs besser mit der ursprünglichen Form übereinstimmt. Dieses Anpassen (bzw. Testen auf eine mögliche Anpassung) wird wiederholt für neue Kamerabilder durchgeführt (z.B. für jedes empfangene Kamerabild einer Folge von Kamerabildern).Therefore, according to (B), a new descriptor 406 is selected for the descriptor set by means of which the paired distances between the 3D positions of the locations given by the descriptors better match the determined distances 403, i.e. the shape of the determined area better matches the original shape . This adjustment (or testing for a possible adjustment) is carried out repeatedly for new camera images (e.g. for each received camera image in a sequence of camera images).

Zusammengefasst wird gemäß verschiedenen Ausführungsformen ein Verfahren wie in 5 veranschaulicht bereitgestellt.In summary, according to various embodiments, a method as in 5 illustrated provided.

5 zeigt ein Ablaufdiagramm 500 für ein Verfahren zum Lokalisieren von Stellen von Objekten aus Kamerabildern der Objekte. 5 FIG. 5 shows a flowchart 500 for a method for locating locations of objects from camera images of the objects.

In 501 werden die zu lokalisierenden Stellen für einen Objekttyp der Objekte festgelegt.In 501 the locations to be localized for an object type of the objects are specified.

In 502 wird eine Referenz für die relative Lage der zu lokalisierenden Stellen bestimmt.In 502 a reference for the relative position of the points to be located is determined.

In 503 wird für den Objekttyp ein maschinellen Lernmodells zum Abbilden von Kamerabildern, wobei jedes Kamerabild ein Objekt des Objekttyps zeigt, auf Deskriptorbilder trainiert, wobei ein Deskriptorbild, auf das ein Kamerabild abzubilden ist, für eine Stelle des Objekts, die das Kamerabild an einer Bildposition zeigt, an der Bildposition einen Deskriptor der Stelle des Objekts aufweist.In 503, for the object type, a machine learning model for mapping camera images, each camera image showing an object of the object type, is trained on descriptor images, with a descriptor image onto which a camera image is to be mapped for a point of the object that the camera image is at an image position shows at the image position a descriptor of the location of the object.

In 504 wird ein Referenzsatz von Deskriptoren auf einen Anfangssatz von Deskriptoren gesetzt.In 504, a reference set of descriptors is set to an initial set of descriptors.

In 505 wird eine zeitliche Folge von Kamerabildern empfangen, wobei jedes Kamerabild ein Objekt des Objekttyps zeigt.In 505 a temporal sequence of camera images is received, each camera image showing an object of the object type.

In 506 werden für jedes Kamerabild die zu lokalisierenden Stellen auf dem jeweiligen Objekt lokalisiert, durch Abbilden des Kamerabilds auf ein Deskriptorbild mittels des trainierten maschinellen Lernmodells und Identifizieren der zu lokalisierenden Stellen des Objekts für den Referenzsatz von Deskriptoren durch Suchen der Deskriptoren des Referenzsatzes von Deskriptoren in dem Deskriptorbild (z.B. zum Suchen von Positionen im Deskriptorwert, die Deskriptoren aufweisen, die möglichst nah an den Deskriptoren des Referenzsatzes liegen, wie oben im Zusammenhang mit 3 beschrieben). Damit werden die zu lokalisierenden Stellen in der Kamerabildebene lokalisiert und können, z.B. mittels Tiefeninformation oder auch Lösen eines PnP (Perspective-n-Point-)-Problems im dreidimensionalen Raum lokalisiert werden.In 506, for each camera image, the locations to be localized on the respective object are localized by mapping the camera image to a descriptor image using the trained machine learning model and identifying the locations of the object to be localized for the reference set of descriptors by searching the descriptors of the reference set of descriptors in the descriptor image (e.g. to search for positions in the descriptor value that have descriptors as close as possible to the descriptors of the reference set, as above in connection with 3 described). The points to be localized are thus localized in the camera image plane and can be localized in three-dimensional space, for example by means of depth information or by solving a PnP (perspective-n-point) problem.

In 507 wird zumindest für einen Teil der Kamerabilder der Folge von Kamerabildern durchgeführt:

Ermitteln der relativen Lage der lokalisierten Stellen für den Referenzsatz; und
Vergleich der Referenz für die relativen Lage der zu lokalisierenden Stellen und der für den Referenzsatz ermittelten relativen Lage der lokalisierten Stellen und
Identifizieren der zu lokalisierenden Stellen des Objekts für einen Testsatz von Deskriptoren durch Suchen der Deskriptoren des Testsatzes von Deskriptoren in dem Deskriptorbild;
Ermitteln der relativen Lage der lokalisierten Stellen für den Testsatz; und Vergleich der Referenz für die relativen Lage der zu lokalisierenden Stellen und der für den Testsatz ermittelten relativen Lage der lokalisierten Stellen; und Aktualisieren der Deskriptoren des Referenzsatzes auf die Deskriptoren des Testsatzes falls die Übereinstimmung für ein oder mehrere der Kamerabilder zwischen der Referenz für die relativen Lage der zu lokalisierenden Stellen und der ermittelten relativen Lage der lokalisierten Stellen für den Testsatz besser ist als für den Referenzsatz.

In 507 the following is carried out for at least some of the camera images of the sequence of camera images:

determining the relative location of the located locations for the reference set; and
Comparison of the reference for the relative position of the points to be located and that for the Refe limit rate determined relative position of the localized points and
identifying the locations of the object to be located for a test set of descriptors by searching the descriptors of the test set of descriptors in the descriptor image;
determining the relative location of the located locations for the test set; and comparing the reference for the relative locations of the locations to be located and the relative locations of the located locations determined for the test set; and updating the reference set descriptors to the test set descriptors if the match for one or more of the camera images between the reference for the relative locations of locations to be located and the determined relative locations of the located locations is better for the test set than for the reference set.

Es sollte beachtet werden, dass 506 und 507 parallel oder abwechselnd ausgeführt werden, z.B. wird 507 für jedes Kamerabild (oder auch nach einer vorgegebenen Anzahl von Kamerabildern) im Laufe des Lokalisierungsprozess von 506 durchgeführt.It should be noted that 506 and 507 are performed in parallel or alternately, e.g., 507 is performed for each camera image (or even after a predetermined number of camera images) in the course of the 506 localization process.

Gemäß verschiedenen Ausführungsformen wird in anderen Worten ein maschinelles Lernmodell zum Abbilden von Kamerabildern auf Deskriptorbilder trainiert und für jedes Kamerabild werden die zu lokalisierenden Stellen auf dem jeweiligen Objekt mittels der von dem trainierten maschinellen Lernmodell gelieferten Deskriptoren ermittelt. Die Deskriptoren, die zum Ermitteln der zu lokalisierenden Stellen verwendet werden, werden im Laufe der Folge von Kamerabildern aktualisiert (d.h. angepasst), um z.B. sich ändernden Lichtverhältnissen zu folgen oder auch Fehler im ersten Referenzsatz von Deskriptoren oder mangelnde Optimalität des ersten Referenzsatzes von Deskriptoren auszugleichen.In other words, according to various embodiments, a machine learning model for mapping camera images onto descriptor images is trained and for each camera image the points to be localized on the respective object are determined using the descriptors supplied by the trained machine learning model. The descriptors used to determine the locations to be located are updated (i.e. adjusted) over the course of the sequence of camera images, e.g. to follow changing lighting conditions or to compensate for errors in the first reference set of descriptors or a lack of optimality in the first reference set of descriptors .

Die Objekte sind Instanzen des Objekttyps, haben also z.B. alle dieselbe durch den Objekttyp vorgegebene Form. Beispielsweise sind die Objekte Bauteile eines mit einer bestimmten Form. Es kann aber auch Unterschiede in der Form geben, solange die Topologie der Objekte gleich ist. Beispielsweise kann der Objekttyp „Schuh“ sein, die zu lokalisierenden Stellen Randpunkte der Schuhzunge und die Objekte können verschiedene Schuhe sein.The objects are instances of the object type, e.g. all have the same form specified by the object type. For example, the objects are parts of one with a specific shape. However, there can also be differences in shape as long as the topology of the objects is the same. For example, the object type can be "shoe", the points to be localized can be edge points of the shoe tongue and the objects can be different shoes.

Das maschinelle Lernmodell ist beispielsweise ein neuronales Netz. Es können aber auch andere maschinelle Lernmodelle verwendet werden, die entsprechend trainiert werden.For example, the machine learning model is a neural network. However, other machine learning models can also be used, which are trained accordingly.

Gemäß verschiedenen Ausführungsformen weist das maschinelle Lernmodell Pixeln des Objekts (in der Bildebene des jeweiligen Kamerabilds) Deskriptoren zu. Dies kann als indirektes Codieren der Oberflächentopologie des Objekts angesehen werden. Diese Verbindung zwischen Deskriptoren und der Oberflächentopologie kann durch Rendern explizit vorgenommen werden, um die Deskriptoren auf die Bildebene abzubilden. Es sollte angemerkt werden, dass Deskriptoren an Flächen (d. h. Punkten, die keine Vertices sind) des Objektmodells mittels Interpolation bestimmt werden können. Wenn beispielsweise eine Fläche durch 3 Vertices des Objektmodells mit ihren jeweiligen Deskriptoren y1, y2, y3 gegeben ist, dann kann an einem beliebigen Punkt der Fläche der Deskriptor y als eine gewichtete Summe dieser Werte
w₁ · y₁ + w₂ · y₂ + w₃ · y₃ berechnet werden. Mit anderen Worten werden die Deskriptoren an den Vertices interpoliert.According to various embodiments, the machine learning model assigns descriptors to pixels of the object (in the image plane of the respective camera image). This can be viewed as indirectly encoding the surface topology of the object. This connection between descriptors and the surface topology can be made explicit by rendering to map the descriptors to the image plane. It should be noted that descriptors on faces (ie points that are not vertices) of the object model can be determined using interpolation. For example, if a surface is given by 3 vertices of the object model with their respective descriptors y1, y2, y3, then at any point of the surface the descriptor y can be a weighted sum of these values
w ₁ * y ₁ + w ₂ * y ₂ + w ₃ * y ₃ can be calculated. In other words, the descriptors are interpolated at the vertices.

Zum Erzeugen von Bildpaaren für Trainingsdaten für das maschinelle Lernmodell wird beispielsweise ein Bild des Objekts (z. B. ein RGB-Bild) einschließlich des Objekts (oder mehrerer Objekte) mit bekanntem 3D(z. B. CAD)-Modell und bekannter Stellung (in einem globalen (d. h. Welt-) Koordinatensystem auf ein (dichtes) Deskriptorbild abgebildet, das in dem Sinne optimal ist, dass es durch eine Suche nach Deskriptoren zum Minimieren der Abweichung geometrischer Eigenschaften (insbesondere der Nähe von Punkten des Objekts) zwischen dem Objektmodell und seiner Repräsentation (Einbettung) im Deskriptorraum erzeugt wird. Im praktischen Gebrauch wird die theoretische optimale Lösung zur Minimierung im Allgemeinen nicht lokalisiert werden, da die Suche auf einen gewissen Suchraum beschränkt ist. Nichtsdestotrotz wird eine Schätzung des Minimums innerhalb der Beschränkungen einer praktischen Anwendung (verfügbare Berechnungsgenauigkeit, maximale Anzahl von Iterationen usw.) bestimmt. For example, to generate image pairs for training data for the machine learning model, an image of the object (e.g. an RGB image) including the object (or several objects) with known 3D (e.g. CAD) model and known pose ( in a global (i.e. world) coordinate system, is mapped to a (dense) descriptor image that is optimal in the sense that it is obtained by a search for descriptors to minimize the deviation of geometric properties (especially the proximity of points of the object) between the object model and of its representation (embedding) in the descriptor space. In practical use, the theoretical optimal solution to the minimization will in general not be located, since the search is restricted to a certain search space. Nevertheless, an estimate of the minimum within the constraints of a practical application (available calculation accuracy, maximum number of iterations, etc.).

Jedes Trainingsdatenbildpaar umfasst ein Trainingseingabebild des Objekts und ein Zielbild, wobei das Zielbild durch Projizieren der in dem Trainingseingabebild sichtbaren Deskriptoren der Vertices auf die Trainingseingabebildebene gemäß der Stellung, die das Objekt in dem Trainingseingabebild aufweist, erzeugt wird. Die Bilder zusammen mit ihren assoziierten Zielbildern werden zum überwachten Trainieren des maschinellen Lernmodells verwendet.Each training data image pair comprises a training input image of the object and a target image, the target image being generated by projecting the descriptors of the vertices visible in the training input image onto the training input image plane according to the pose the object has in the training input image. The images along with their associated target images are used for supervised training of the machine learning model.

Das maschinelle Lernmodell wird somit trainiert, um eindeutige Merkmale eines Objekts (oder mehrerer Objekte) zu erkennen. Diese Informationen können für verschiedene Anwendungen bei der Robotersteuerung mittels Evaluierung des maschinellen Lernmodells in Echtzeit verwendet werden, z. B. Vorhersagen einer Objektgreifstellung zur Montage. Es sollte angemerkt werden, dass der überwachte Trainingsansatz das explizite Codieren von Symmetrieinformationen ermöglicht.The machine learning model is thus trained to recognize unique features of an object (or objects). This information can be used for various robot control applications by evaluating the machine learning model in real time, e.g. B. Predicting an object gripping pose for assembly. It should be noted that the supervised training approach allows for explicit coding of symmetry information.

Das Verfahren von 5 kann durch einen oder mehrere Computer durchgeführt werden, die eine oder mehrere Datenverarbeitungseinheiten beinhalten. Der Ausdruck „Datenverarbeitungseinheit“ kann als eine beliebige Art von Entität verstanden werden, die die Verarbeitung von Daten oder Signalen ermöglicht. Beispielsweise können die Daten oder Signale gemäß mindestens einer (d. h. einer oder mehr als einer) spezifischen Funktion bearbeitet werden, die durch die Datenverarbeitungseinheit durchgeführt wird. Eine Datenverarbeitungseinheit kann eine analoge Schaltung, eine digitale Schaltung, eine Mischsignalschaltung, eine Logikschaltung, einen Mikroprozessor, einen Mikrocontroller, eine Zentralverarbeitungseinheit (CPU), eine Grafikverarbeitungseinheit (GPU), einen Digitalsignalprozessor (DSP), ein programmierbares Gate-Array (FPGA), eine integrierte Schaltung oder eine beliebige Kombination davon beinhalten oder daraus gebildet werden. Eine beliebige andere Weise zum Implementieren der jeweiligen Funktionen, die nachstehend ausführlicher beschrieben wird, kann auch als eine Datenverarbeitungseinheit oder Logikschaltungsanordnung verstanden werden. Es versteht sich, dass ein oder mehrere der hierin ausführlich beschriebenen Verfahrensschritte durch eine Datenverarbeitungseinheit über eine oder mehrere spezifische durch die Datenverarbeitungseinheit durchgeführte Funktionen ausgeführt (z. B. implementiert) werden können.The procedure of 5 can be performed by one or more computers that include one or more data processing units. The term "computing unit" can be understood as any type of entity that enables the processing of data or signals. For example, the data or signals may be processed according to at least one (ie one or more than one) specific function performed by the data processing unit. A data processing unit may be an analog circuit, a digital circuit, a mixed-signal circuit, a logic circuit, a microprocessor, a microcontroller, a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), a programmable gate array (FPGA), include or be formed from an integrated circuit or any combination thereof. Any other way of implementing the respective functions, which will be described in more detail below, can also be understood as a data processing unit or logic circuitry. It is understood that one or more of the method steps detailed herein may be performed (e.g., implemented) by a computing device via one or more specific functions performed by the computing device.

Verschiedene Ausführungsformen können Sensorsignale von verschiedenen Sensoren wie etwa einer (z. B. RGB oder RGB-D) Kamera, einem Video-, Radar-, LiDAR-, Ultraschall-, Wärmebildtechnik-Sensor usw. empfangen und verwenden, um beispielsweise Sensordaten zu erhalten, die ein Objekt zeigen. Ausführungsformen zum autonomen Steuern eines Roboters verwendet werden, z. B. eines Robotermanipulators, um verschiedene Manipulationsaufgaben unter verschiedenen Szenarios zu erzielen. Insbesondere sind Ausführungsformen bei der Steuerung und Überwachung der Ausführung von Manipulationsaufgaben z. B. bei Montagelinien, anwendbar.Various embodiments may receive and use sensor signals from various sensors, such as a (e.g., RGB or RGB-D) camera, video, radar, LiDAR, ultrasonic, thermal imaging sensor, etc. to obtain sensor data, for example , showing an object. Embodiments are used to autonomously control a robot, e.g. B. a robot manipulator to achieve different manipulation tasks under different scenarios. In particular, embodiments in the control and monitoring of the execution of manipulation tasks z. B. in assembly lines applicable.

Obwohl spezifische Ausführungsformen hierin veranschaulicht und beschrieben wurden, sollte ein Durchschnittsfachmann erkennen, dass eine Vielfalt alternativer und/oder äquivalenter Implementierungen für die spezifischen gezeigten und beschriebenen Ausführungsformen ersetzt werden können, ohne vom Schutzumfang der vorliegenden Erfindung abzuweichen. Diese Anmeldung soll jegliche Anpassungen oder Variationen der spezifischen hierin besprochenen Ausführungsformen abdecken. Daher wird beabsichtigt, dass diese Erfindung nur durch die Ansprüche und deren Äquivalente beschränkt wird.Although specific embodiments have been illustrated and described herein, one of ordinary skill in the art should recognize that a variety of alternative and/or equivalent implementations may be substituted for the specific embodiments shown and described without departing from the scope of the present invention. This application is intended to cover any adaptations or variations of the specific embodiments discussed herein. Therefore, it is intended that this invention be limited only by the claims and their equivalents.

Claims

Method for locating locations of objects from camera images of the objects, comprising: determining the locations to be localized for an object type of the objects; determining a reference for the relative location of the locations to be located; Training, for the object type, a machine learning model for mapping camera images, each camera image showing an object of the object type, to descriptor images, a descriptor image to which a camera image is to be mapped, for a location of the object showing the camera image at an image location , has at the image position a descriptor of the location of the object; setting a reference set of descriptors to an initial set of descriptors; Receiving a chronological sequence of camera images, each camera image showing an object of the object type, and locating, for each camera image, the points to be localized on the respective object by: Mapping the camera image to a descriptor image using the trained machine learning model Identifying the locations of the object to be located for the reference set of descriptors by looking up the descriptors of the reference set of descriptors in the descriptor image; and having at least for part of the camera images of the sequence of camera images: determining the relative location of the located locations for the reference set; and Comparison of the reference for the relative position of the locations to be located and the relative position of the located locations determined for the reference set identifying the locations of the object to be located for a test set of descriptors by searching the descriptors of the test set of descriptors in the descriptor image; determining the relative location of the located locations for the test set; and comparing the reference for the relative location of the locations to be located and the relative location of the located locations determined for the test set; and updating the descriptors of the reference set to the descriptors of the test set if the match for one or more of the camera images between the reference for the relative positions of the locations to be located and the determined relative positions of the located locations is better for the test set than for the reference set.

procedure after claim 1 , the relative position having the paired distances of the points to be localized or the localized points in three-dimensional space.

procedure after claim 1 or 2 , wherein the locations to be localized for the object type are specified on a reference camera image of an object of the object type, the reference camera image is fed to the machine learning model and the reference set of descriptors is based on the descriptors of the locations to be localized in the location defined by the machine learning model for the Reference camera image output descriptor image is set.

Procedure according to one of Claims 1 until 3 , where the test set is chosen within a restricted range of the initial set of descriptors.

Procedure according to one of Claims 1 until 4 , comprising determining the test set of descriptors using a covariance matrix adaptation evolution strategy method.

A method of controlling a robot, comprising: locating locations of an object to be treated by the robot according to any one of Claims 1 until 5 , determining a pose of the object from the localized points and controlling the robot depending on the determined pose and/or determining an area of the object from the localized points and controlling the robot depending on the determined area.

Software or hardware agent, in particular a robot, having the following: a camera set up to provide camera images of objects; a control device for carrying out the method according to one of Claims 1 until 6 is set up.

software or hardware agent claim 7 , which has at least one actuator, wherein the control device is configured to control the at least one actuator using the localized locations.

A computer program comprising instructions which, when executed by a processor, cause the processor to perform a method according to any one of Claims 1 until 6 performs.

A computer-readable medium storing instructions which, when executed by a processor, cause the processor to perform a method according to any one of Claims 1 until 6 performs.