DE102022202321A1

DE102022202321A1 - Device and method for controlling a robot

Info

Publication number: DE102022202321A1
Application number: DE102022202321.8A
Authority: DE
Inventors: Dotan Di Castro; Yakov Miron; Vladimir TCHUIEV
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2023-09-14

Abstract

Gemäß verschiedenen Ausführungsformen wird ein Verfahren zum Trainieren eines neuronalen Netzes zur Erkennung von Objekten in Bilddaten beschrieben, umfassend, für jedes Bild einer Mehrzahl von Trainingsbildern, Bestimmen, für jeden einer Mehrzahl von Objektbegrenzungsrahmen, von Begrenzungsrahmenkoordinaten, Objekthierarchieinformationen, und Klassifizierungsinformationen über Objekte, die den Objektbegrenzungsrahmen entsprechen, und Trainieren des neuronalen Netzes, um einen Verlust zwischen Grundwissen für die Hierarchieinformationen, für die Begrenzungsrahmenkoordinaten und für die Klassifizierungsinformationen und den Hierarchieinformationen zu reduzieren, wobei die Begrenzungsrahmenkoordinaten und die Klassifizierungsinformationen, jeweils für die Trainingsbilder bestimmt wurden.

According to various embodiments, a method for training a neural network to recognize objects in image data is described, comprising, for each image of a plurality of training images, determining, for each of a plurality of object bounding boxes, bounding box coordinates, object hierarchy information, and classification information about objects that are the correspond to object bounding boxes, and training the neural network to reduce a loss between basic knowledge for the hierarchy information, for the bounding box coordinates and for the classification information and the hierarchy information, wherein the bounding box coordinates and the classification information have been determined for the training images, respectively.

Description

Stand der TechnikState of the art

Die vorliegende Offenbarung betrifft Vorrichtungen und Verfahren zur Steuerung eines Roboters.The present disclosure relates to devices and methods for controlling a robot.

Robotervorrichtungen können unter Verwendung von maschinellem Lernen für die Objektmanipulation trainiert werden. Dies beinhaltet typischerweise ein Trainieren eines maschinellen Lernmodells, um eine Objekterkennung durchzuführen, wie beispielsweise in N. Carion et al., „End-to-end object detection with transformers“, in der European Conference on Computer Vision, Springer, 2020, Seiten 213-229 , hierin als Referenz 1 bezeichnet, oder in X. Zhu et al. „Deformable detr: Deformable transformers for end-to-end object detection“, arXiv-Preprint arXiv:2010.04159, 2020, hierin als Referenz 2 bezeichnet, beschrieben wird.Robotic devices can be trained for object manipulation using machine learning. This typically involves training a machine learning model to perform object detection, such as in N. Carion et al., “End-to-end object detection with transformers,” in European Conference on Computer Vision, Springer, 2020, pages 213-229 , referred to herein as Reference 1, or in X. Zhu et al. “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020, referred to herein as reference 2.

Insbesondere die Objektmanipulation mit einem Roboterarm ist von großem Interesse und hat eine breite Palette von Anwendungen, z. B. in der industriellen Fertigung oder Suche & Rettung. Dies können insbesondere Anwendungen sein, bei denen sich viele Objekte häufen und die Objekte daher aufgrund von Verdeckungen usw. schwer zu erkennen sind. Jedoch kann es auch in solchen Fällen notwendig sein, dass Objekte zuverlässig erkannt werden müssen. Ferner sollten ihre Stapelbeziehungen bestimmt werden, z. B., um zu entscheiden, welches Objekt zuerst manipuliert werden soll.In particular, object manipulation with a robotic arm is of great interest and has a wide range of applications, e.g. B. in industrial manufacturing or search & rescue. This can particularly be applications where many objects accumulate and the objects are therefore difficult to detect due to occlusions etc. However, even in such cases it may be necessary for objects to be reliably recognized. Furthermore, their stacking relationships should be determined, e.g. B. to decide which object should be manipulated first.

Dementsprechend sind zuverlässige Ansätze zur Erkennung von Objekten wünschenswert, wenn mehrere Objekte zusammen angeordnet sind, z. B. im Arbeitsbereich eines Roboters, so dass zwischen ihnen Stapelbeziehungen bestehen (d. h., die Objekte häufen sich zumindest teilweise), und zum Erkennen dieser Stapelbeziehungen.Accordingly, reliable approaches to detecting objects are desirable when multiple objects are arranged together, e.g. B. in the work area of a robot, so that stacking relationships exist between them (i.e. the objects are at least partially piled up), and to recognize these stacking relationships.

Offenbarung der ErfindungDisclosure of the invention

Gemäß verschiedenen Ausführungsformen wird ein Verfahren zum Trainieren eines neuronalen Netzes zur Erkennung von Objekten in Bilddaten bereitgestellt, umfassend, für jedes Bild einer Mehrzahl von Trainingsbildern,

• Bestimmen von Merkmalsvektoren aus dem Trainingsbild durch einen Encoder eines Transformer-Encoder-Decoder-Netzes;
• Bestimmen, für jeden einer Mehrzahl von Objektbegrenzungsrahmen, von Begrenzungsrahmenkoordinaten aus den Merkmalsvektoren durch einen Decoder des Encoder-Decoder-Netzes und einen Begrenzungsrahmenkopf;
• Bestimmen von Objekthierarchieinformationen aus den Begrenzungsrahmenkoordinaten für die Mehrzahl von Objekten und von Merkmalen für die Begrenzungsrahmen, die aus den Merkmalsvektoren abgeleitet werden, wobei die Objekthierarchieinformationen Stapelbeziehungen der Mehrzahl von Objekten in Bezug zueinander spezifizieren; und
• Bestimmen von Klassifizierungsinformationen über Objekte, die den Objektbegrenzungsrahmen entsprechen, durch einen Klassifizierungskopf.

According to various embodiments, a method for training a neural network to recognize objects in image data is provided, comprising, for each image of a plurality of training images,

• Determining feature vectors from the training image by an encoder of a transformer-encoder-decoder network;
• determining, for each of a plurality of object bounding boxes, bounding box coordinates from the feature vectors by a decoder of the encoder-decoder network and a bounding box header;
• determining object hierarchy information from the bounding box coordinates for the plurality of objects and features for the bounding boxes derived from the feature vectors, the object hierarchy information specifying stacking relationships of the plurality of objects with respect to one another; and
• Determining classification information about objects corresponding to the object bounding boxes by a classification head.

Das Verfahren umfasst ferner Trainieren des neuronalen Netzes zum Reduzieren eines Verlusts zwischen Grundwissen für die Hierarchieinformationen, für die Begrenzungsrahmenkoordinaten und für die Klassifizierungsinformationen und die Hierarchieinformationen, wobei die die Begrenzungsrahmenkoordinaten und die Klassifizierungsinformationen jeweils für die Trainingsbilder bestimmt wurden.The method further includes training the neural network to reduce a loss between basic knowledge for the hierarchy information, for the bounding box coordinates and for the classification information and the hierarchy information, where the bounding box coordinates and the classification information have been determined for the training images, respectively.

Durch die Berücksichtigung von Begrenzungsrahmen, Klassifizierungsinformationen der Begrenzungsrahmen und Adjazenzinformationen beim Training gemäß dem obigen Verfahren, d. h. durch Trainieren des neuronalen Netzes zum Bestimmen von Begrenzungsrahmen und Klassifizierungen für die Begrenzungsrahmen und Trainieren des neuronalen Netzes zum Bestimmen von Adjazenzinformationen gemäß dem obigen Verfahren, kann eine hohe Leistungsfähigkeit des (trainierten) neuronalen Netzes für diese drei Inferenzergebnisse erreicht werden. Hier bezieht sich die Leistung beispielsweise auf eine Kombination aus einer Genauigkeit der Klassifizierung, einer Begrenzungsrahmen-Inferenzgenauigkeit, und einer Genauigkeit der Objektbeziehungsinferenz über die Adjazenzmatrix. Beispielsweise wird dies aus der Objekt-Präzision (wodurch beschrieben wird, wie viele der erkannten Tupel (Erkennung, Klassifizierung und Beziehung) alle korrekt sind) und dem Objekt-Recall (wodurch beschrieben wird, wie viele korrekte Tupel erkannt wurden aus allen Zieltupeln in einem Bild), die erreicht werden können, abgeleitet. Die Leistung kann sich beispielsweise auf die Bildgenauigkeit beziehen, die den Prozentsatz der Bilder angibt, bei denen die OP 100 % und der OR 100 % wäre, mit anderen Worten, die Bilder, bei denen alle Zieltupel korrekt erkannt wurden und nicht mehr.By considering bounding boxes, bounding box classification information and adjacency information when training according to the above method, i.e. H. By training the neural network to determine bounding boxes and classifications for the bounding boxes and training the neural network to determine adjacency information according to the above method, high performance of the (trained) neural network can be achieved for these three inference results. Here, for example, performance refers to a combination of classification accuracy, bounding box inference accuracy, and object relation inference accuracy via the adjacency matrix. For example, this is made up of object precision (which describes how many of the detected tuples (detection, classification and relationship) are all correct) and object recall (which describes how many correct tuples were detected from all target tuples in one Image) that can be achieved are derived. For example, performance may refer to image accuracy, which indicates the percentage of images where the OP would be 100% and the OR would be 100%, in other words, the images where all target tuples were correctly detected and no more.

Durch die Berücksichtigung der Adjazenzinformationen ist das neuronale Netz in der Lage, die räumlichen Beziehungen zwischen in Bildern gezeigten Objekten abzuleiten, und kann diese Informationen somit beim Erkennen (Detektieren) von Objekten berücksichtigen, wie etwa die Berücksichtigung von Schatten, die von anderen Objekten geworfen werden, und von Teilen eines Objekts, die verdeckt werden. Die bestimmten Adjazenzinformationen (auch als Objekthierarchieinformationen bezeichnet, z. B. in Form einer Adjazenzmatrix) können von nachgeschalteten Komponenten weiterverwendet werden oder nicht. Beispielsweise kann eine Robotervorrichtungssteuerung eine Robotervorrichtung rein basierend auf der Ausgabe der Sehaufgabe (Begrenzungsrahmen und Klassifizierung) steuern. Mit anderen Worten, in einer Ausführungsform werden die Objekthierarchieinformationen (Adjazenzinformationen) nur für das Training verwendet.By taking adjacency information into account, the neural network is able to infer the spatial relationships between objects shown in images and can thus take this information into account when recognizing (detecting) objects, such as taking into account shadows cast by other objects , and parts of an object that are obscured. The specific adjacency information (also called object hierarchy information, e.g. in the form of an adjacency matrix) may or may not be further used by downstream components. For example, a robotic device controller may control a robotic device based purely on the output of the visual task (bounding box and classification). In other words, in one embodiment, the object hierarchy information (adjacency information) is used only for training.

Im Folgenden werden verschiedene Beispiele beschrieben.Various examples are described below.

Beispiel 1 ist das Verfahren zum Trainieren eines neuronalen Netzes zur Erkennung von Objekten in Bilddaten, wie es oben beschrieben wird.Example 1 is the method for training a neural network to recognize objects in image data as described above.

Beispiel 2 ist das Verfahren von Beispiel 1, umfassend Transformieren der Merkmalsvektoren in einen Merkmalsvektor für jeden Objektbegrenzungsrahmen, Verketten, für jeden Objektbegrenzungsrahmen, des Merkmalsvektors für den Begrenzungsrahmen mit den Begrenzungsrahmenkoordinaten für den Objektbegrenzungsrahmen zu einem Eingabevektor, und Bestimmen der Objekthierarchieinformationen aus den Eingabevektoren.Example 2 is the method of Example 1, comprising transforming the feature vectors into a feature vector for each object bounding box, concatenating, for each object bounding box, the feature vector for the bounding box with the bounding box coordinates for the object bounding box into an input vector, and determining the object hierarchy information from the input vectors.

Daher werden die Merkmalswerte durch eine Dimensionstransformation transformiert, um sie an die Anzahl der Begrenzungsrahmen anzupassen. Dies ermöglicht eine komfortable Eingabe in Schichten eines neuronalen Netzes zum Bestimmen der Objekthierarchie durch Verkettung.Therefore, the feature values are transformed through dimension transformation to adapt to the number of bounding boxes. This enables convenient input into layers of a neural network to determine the object hierarchy through concatenation.

Beispiel 3 ist das Verfahren von Beispiel 2, umfassend Bestimmen der Objekthierarchieinformationen aus den Eingabevektoren durch eine Mehrzahl von Verarbeitungskanälen, wobei jeder Verarbeitungskanal die Eingabevektoren für die Objektbegrenzungsrahmen empfängt, und eine Adjazenzmatrixkomponente erzeugt, und Multiplizieren der von den Kanälen erzeugten Adjazenzmatrixkomponenten, um eine Adjazenzmatrix zu erzeugen, die die Objekthierarchieinformationen spezifiziert.Example 3 is the method of Example 2, comprising determining the object hierarchy information from the input vectors through a plurality of processing channels, each processing channel receiving the input vectors for the object bounding boxes and producing an adjacency matrix component, and multiplying the adjacency matrix components generated by the channels to form an adjacency matrix generate that specifies the object hierarchy information.

Beispielsweise ist die Adjazenzmatrix quadratisch und die Adjazenzmatrixkomponenten rechteckig und ergeben die Adjazenzmatrix, wenn sie multipliziert werden. Beispielsweise hat jeder Kanalausgang die Dimensionen Q x H, wobei Q = 300 (Abfragen des Begrenzungsrahmens) und H = 260 (die verborgene Größe des Encoders plus die des Begrenzungsrahmenkoordinaten) ist. Die Multiplikation der Ausgänge der Kanäle ergibt die Adjazenzmatrix. Diese Matrixmultiplikation ermöglicht, dass die Elemente der Adjazenzmatrix hinsichtlich der Objekterkennungsreihenfolge invariant sind, beispielsweise sollte die Beziehung zwischen den Objekten 125 und 262 identisch sein, wenn sie stattdessen beispielsweise als Objekte 231 und 51 identifiziert würden. Mit anderen Worten, die Abfragereihenfolge spielt keine Rolle.For example, the adjacency matrix is square and the adjacency matrix components are rectangular and produce the adjacency matrix when multiplied. For example, each channel output has dimensions Q x H, where Q = 300 (bounding box queries) and H = 260 (the hidden size of the encoder plus that of the bounding box coordinates). Multiplying the outputs of the channels results in the adjacency matrix. This matrix multiplication allows the elements of the adjacency matrix to be invariant with respect to object recognition order, for example the relationship between objects 125 and 262 should be identical if they were instead identified as objects 231 and 51, for example. In other words, query order doesn't matter.

Beispiel 4 ist das Verfahren nach einem der Beispiele 1 bis 3, wobei das neuronale Netz unter Verwendung eines kombinierten Verlustes aus einem Kreuzentropieverlust für die Klassifizierungsinformationen, einem binären elementweisen Verlust für die Objekthierarchieinformationen, und einer linearen Kombination aus absolutem Abweichungsverlust und generalisiertem Überlappungsverlust für die Begrenzungsrahmenkoordinaten trainiert wird.Example 4 is the method according to any of Examples 1 to 3, wherein the neural network is formed using a combined loss of a cross-entropy loss for the classification information, a binary element-wise loss for the object hierarchy information, and a linear combination of absolute deviation loss and generalized overlap loss for the bounding box coordinates is trained.

Dieser Verlust stellt ein robustes Training bereit, während alle drei Ergebnisse des neuronalen Netzes (Klassifizierungen, Begrenzungsrahmen, und Objekthierarchie) berücksichtigt werden.This loss provides robust training while taking all three neural network outputs (classifiers, bounding boxes, and object hierarchy) into account.

Beispiel 5 ist ein Verfahren zur Steuerung einer Robotervorrichtung, umfassend Trainieren eines maschinellen Lernmodells gemäß einem der Beispiele 1 bis 4, Erfassen eines Kamerabildes, das das eine oder die mehreren Objekte zeigt, Einspeisen des Kamerabildes in das neuronale Netz, um das eine oder die mehren Objekte zu erkennen, und Steuern des Roboters unter Berücksichtigung der erkannten einen oder mehreren Objekte.Example 5 is a method for controlling a robotic device, comprising training a machine learning model according to any one of Examples 1 to 4, capturing a camera image showing the one or more objects, feeding the camera image to the neural network to display the one or more objects Detect objects and control the robot taking into account the detected one or more objects.

Beispiel 6 ist eine Steuerung, die dazu ausgelegt ist, ein Verfahren nach einem der Beispiele 1 bis 5 durchzuführen.Example 6 is a controller designed to perform a method according to any of Examples 1 to 5.

Beispiel 7 ist ein Computerprogramm, das Anweisungen umfasst, die, wenn sie von einem Computer ausgeführt werden, bewirken, dass der Computer ein Verfahren gemäß einem der Beispiele 1 bis 5 durchführt.Example 7 is a computer program that includes instructions that, when executed by a computer, cause the computer to perform a method according to any of Examples 1 to 5.

Beispiel 8 ist ein computerlesbares Medium, das Anweisungen umfasst, die, wenn sie von einem Computer ausgeführt werden, bewirken, dass der Computer ein Verfahren gemäß einem der Beispiele 1 bis 5 durchführt.Example 8 is a computer-readable medium comprising instructions that, when executed by a computer, cause the computer to perform a method according to any of Examples 1 to 5.

In den Zeichnungen beziehen sich ähnliche Bezugszeichen in den unterschiedlichen Ansichten im Allgemeinen auf dieselben Teile. Die Zeichnungen sind nicht notwendigerweise maßstabsgetreu, stattdessen wird allgemein Wert darauf gelegt, die Prinzipien der Erfindung zu veranschaulichen. In der folgenden Beschreibung werden verschiedene Aspekte unter Bezugnahme auf die folgenden Zeichnungen beschrieben, in denen:

1 einen Roboter zeigt.
2 einen Betrieb der Steuerung für eine Bin-Picking-Aufgabe veranschaulicht.
3 ein beispielhaftes Szenario mit drei Objekten zeigt.
4 ein Grundwissen-Objekthierarchiediagramm und ein Beispiel für ein geschätztes Objekthierarchiediagramm für das Szenario mit drei Objekten aus 3 zeigt.
5 ein Flussdiagramm zeigt, das ein Verfahren zur Steuerung einer Robotervorrichtung gemäß einer Ausführungsform veranschaulicht.

In the drawings, similar reference numerals generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, but are generally intended to illustrate the principles of the invention. In the following description various aspects are described with reference to the following drawings, in which:

1 shows a robot.
2 illustrates operation of the controller for a bin picking task.
3 shows an example scenario with three objects.
4 a basic knowledge object hierarchy diagram and an example estimated object hierarchy diagram for the three object scenario 3 shows.
5 shows a flowchart illustrating a method for controlling a robotic device according to an embodiment.

Die folgende ausführliche Beschreibung bezieht sich auf die beigefügten Zeichnungen, die zur Veranschaulichung spezifische Einzelheiten und Aspekte dieser Offenbarung zeigen, in denen die Erfindung praktiziert werden kann. Andere Aspekte können verwendet werden, und strukturelle, logische und elektrische Änderungen können vorgenommen werden, ohne dabei vom Umfang der Erfindung abzuweichen. Die verschiedenen Aspekte dieser Offenbarung schließen sich nicht notwendigerweise gegenseitig aus, da einige Aspekte dieser Offenbarung mit einem oder mehreren anderen Aspekten dieser Offenbarung kombiniert werden können, um neue Aspekte zu bilden.The following detailed description refers to the accompanying drawings which show, by way of illustration, specific details and aspects of this disclosure in which the invention may be practiced. Other aspects may be used and structural, logical and electrical changes may be made without departing from the scope of the invention. The various aspects of this disclosure are not necessarily mutually exclusive, as some aspects of this disclosure may be combined with one or more other aspects of this disclosure to form new aspects.

Im Folgenden werden verschiedene Beispiele ausführlicher beschrieben.Various examples are described in more detail below.

1 zeigt einen Roboter 100. 1 shows a robot 100.

Der Roboter 100 weist einen Roboterarm 101 auf, beispielsweise einen Industrieroboterarm zum Handhaben oder Montieren eines Werkstücks (oder eines oder mehrerer anderer Objekte). Der Roboterarm 101 weist Manipulatoren 102, 103, 104 und eine Basis (oder Halterung) 105 auf, durch die die Manipulatoren 102, 103, 104 getragen werden. Der Begriff „Manipulator“ bezieht sich auf die beweglichen Elemente des Roboterarms 101, deren Betätigung eine physische Interaktion mit der Umgebung ermöglicht, z. B. die Ausführung einer Aufgabe. Zur Steuerung weist der Roboter 100 eine (Roboter-) Steuerung 106 auf, die dazu ausgelegt ist, die Interaktion mit der Umgebung gemäß einem Steuerprogramm zu implementieren. Das letzte Element 104 (am weitesten von der Halterung 105 entfernt) der Manipulatoren 102, 103, 104 wird auch als Endeffektor 104 bezeichnet und weist ein Greifwerkzeug auf (das auch ein Sauggreifer sein kann).The robot 100 includes a robot arm 101, for example an industrial robot arm, for handling or assembling a workpiece (or one or more other objects). The robot arm 101 has manipulators 102, 103, 104 and a base (or holder) 105 through which the manipulators 102, 103, 104 are supported. The term “manipulator” refers to the movable elements of the robot arm 101, the actuation of which enables physical interaction with the environment, e.g. B. the execution of a task. For control, the robot 100 has a (robot) controller 106, which is designed to implement interaction with the environment according to a control program. The last element 104 (farthest from the holder 105) of the manipulators 102, 103, 104 is also referred to as the end effector 104 and has a gripping tool (which can also be a suction cup).

Die anderen Manipulatoren 102, 103 (die sich näher an der Halterung 105 befinden) können eine Positioniervorrichtung bilden, so dass zusammen mit dem Endeffektor 104 der Roboterarm 101 mit dem Endeffektor 104 an seinem Ende bereitgestellt wird. Der Roboterarm 101 ist ein mechanischer Arm, der ähnliche Funktionen wie ein menschlicher Arm bereitstellen kann.The other manipulators 102, 103 (located closer to the holder 105) can form a positioning device so that together with the end effector 104, the robot arm 101 is provided with the end effector 104 at its end. The robot arm 101 is a mechanical arm that can provide similar functions to a human arm.

Der Roboterarm 101 kann Gelenkelemente 107, 108, 109 aufweisen, die die Manipulatoren 102, 103, 104 miteinander und mit der Halterung 105 verbinden. Ein Gelenkelement 107, 108, 109 kann eines oder mehrere Gelenke aufweisen, die an zugeordneten Manipulatoren jeweils eine Drehbewegung (d. h. Rotationsbewegung) und/oder Translationsbewegung (d. h. Verschiebung) relativ zueinander ermöglichen können. Die Bewegung der Manipulatoren 102, 103, 104 kann durch Aktoren initiiert werden, die durch die Steuerung 106 gesteuert werden.The robot arm 101 can have joint elements 107, 108, 109 that connect the manipulators 102, 103, 104 to each other and to the holder 105. A joint element 107, 108, 109 can have one or more joints, which can each enable a rotary movement (i.e. rotational movement) and/or translational movement (i.e. displacement) relative to one another on assigned manipulators. The movement of the manipulators 102, 103, 104 can be initiated by actuators that are controlled by the controller 106.

Der Begriff „Aktor“ kann als eine Komponente verstanden werden, die dazu geeignet ist, einen Mechanismus oder Prozess als Reaktion auf eine Ansteuerung zu beeinflussen. Der Aktor kann von der Steuerung 106 erteilte Anweisungen (die sogenannte Aktivierung) in mechanische Bewegungen umsetzen. Der Aktor, z. B. ein elektromechanischer Wandler, kann dazu ausgelegt sein, als Reaktion auf eine Ansteuerung elektrische Energie in mechanische Energie umzuwandeln.The term “actuator” can be understood as a component that is capable of influencing a mechanism or process in response to a control. The actuator can convert instructions given by the controller 106 (the so-called activation) into mechanical movements. The actor, e.g. B. an electromechanical converter, can be designed to convert electrical energy into mechanical energy in response to a control.

Der Begriff „Steuerung“ kann als jede Art von Logikimplementierungseinheit verstanden werden, die beispielsweise eine Schaltung und/oder einen Prozessor aufweisen kann, die in der Lage sind, in einem Speichermedium gespeicherte Software, Firmware oder eine Kombination davon auszuführen, und die Anweisungen erteilen können, z. B. an einen Aktor, in dem vorliegenden Beispiel. Die Steuerung kann beispielsweise durch einen Programmcode (z. B. Software) konfiguriert sein, um den Betrieb eines Systems, in dem vorliegenden Beispiel eines Roboters, zu steuern.The term “controller” may be understood as any type of logic implementation unit, which may include, for example, a circuit and/or a processor capable of executing software, firmware, or a combination thereof stored in a storage medium and capable of issuing instructions , e.g. B. to an actuator, in the present example. The controller may be configured, for example, through program code (e.g., software) to control the operation of a system, in the present example a robot.

In dem vorliegenden Beispiel weist die Steuerung 106 einen oder mehrere Prozessoren 110 und einen Speicher 111 auf, der Code und Daten speichert, auf deren Grundlage der Prozessor 110 den Roboterarm 101 steuert. Gemäß verschiedenen Ausführungsformen steuert die Steuerung 106 den Roboterarm 101 basierend auf einem in dem Speicher 111 gespeicherten maschinellen Lernmodell 112 (das z. B. eines oder mehrere neuronale Netze umfasst).In the present example, the controller 106 includes one or more processors 110 and a memory 111 that stores code and data based on which the processor 110 controls the robot arm 101. According to various embodiments, the controller 106 controls the robot arm 101 based on a machine learning model 112 (e.g., comprising one or more neural networks) stored in the memory 111.

Die Aufgabe des Roboters ist beispielsweise, ein Bin-Picking durchzuführen, d. h. ein Objekt von mehreren Objekten 113 zu greifen (wobei das Greifen auch ein Aufnehmen des Objekts 113 mit einem Saugnapf aufweist) und den Gegenstand 113 beispielsweise einem Scanner zu zeigen, oder das Objekt 113 zu einem anderen Behälter zu bewegen. Um das aufzunehmende Objekt 113 und eine geeignete Greifstelle an dem Objekt 113 bestimmen zu können, verwendet die Steuerung 106 Bilder des Roboterarbeitsbereichs, in dem sich die Objekte 113 befinden. Diese Bilder können von einer Kamera 114 bereitgestellt werden, die an dem Roboterarm 101 angebracht ist (oder auf irgendeine andere Weise, so dass die Steuerung den Blickwinkel der Kamera 114 steuern kann).The task of the robot is, for example, to carry out bin picking, that is, to grab an object from a plurality of objects 113 (grasping also includes picking up the object 113 with a suction cup) and to show the object 113, for example, to a scanner, or the object 113 to move to another container. In order to be able to determine the object 113 to be picked up and a suitable gripping point on the object 113, the controller 106 uses images of the robot work area in which the objects 113 are located. These images can be taken from a camera 114 attached to the robot arm 101 (or in some other way so that the controller can control the angle of view of the camera 114).

Das Erkennen von Objekten kann schwierig sein, wenn es einen Stapel von Objekten gibt, da in aufgenommenen Bildern Objekte zumindest teilweise verdeckt sein können und ein Objekt einen Schatten auf ein anderes Objekt werfen kann.Detecting objects can be difficult when there is a stack of objects because in captured images, objects may be at least partially obscured and one object may cast a shadow on another object.

Andererseits, wenn ein Stapel von Objekten 113 vorhanden ist, d. h. die Objekte nicht isoliert sind und daher nicht manipuliert werden können, ohne andere Objekte zu beeinflussen, sondern übereinandergestapelt sein können, muss darauf geachtet werden, dass ein Fallen des Stapels vermieden wird, wenn ein tragendes Objekt entfernt wird (und dadurch möglicherweise Objekte beschädigt oder Personen verletzt werden können). Auch im Hinblick auf die Effizienz des Roboterbetriebs kann es wünschenswert sein, die Objekthierarchie (d. h. die gegenseitige Abhängigkeit von Objekten in Bezug auf die Stapelreihenfolge der Objekte) zu respektieren, da es einfacher sein kann und weniger Kraft erfordert, ein Objekt von oben auf einem Haufens zu greifen, als zu versuchen, ein Objekt von unten herauszuziehen.On the other hand, if there is a stack of objects 113, i.e. H. the objects are not isolated and therefore cannot be manipulated without affecting other objects, but may be stacked on top of each other, care must be taken to avoid the stack falling when a supporting object is removed (possibly causing damage to objects or people). can be injured). Also, in terms of efficiency of robot operation, it may be desirable to respect object hierarchy (i.e., the interdependence of objects in terms of the stacking order of objects), as it may be easier and require less force to stack an object from the top of a heap to grab rather than trying to pull an object out from underneath.

Daher sollte die Steuerung 106 bei der Entscheidung, welches Objekt aufgenommen werden soll, in einigen Anwendungsfällen die Objekthierarchie berücksichtigen, z. B., wenn es einen Stapel von Objekten gibt. Dies bedeutet, dass der Roboterarm 101 kein Objekt aufnehmen sollte, auf dem sich ein anderes Objekt befindet, das herunterfallen und zerbrechen kann, wenn der Roboterarm 101 das stützende Objekt aufnimmt. Die Objekthierarchie (-informationen) soll(en) hier also die Beziehungen zwischen Objekten 113 beschreiben, die übereinandergestapelt sind (d. h. eine Stapelbeziehung haben), z. B. eine erste Tasse, die auf einer zweiten Tasse gestapelt ist, usw. Bei Anwendungen, die die Manipulation von Objekten in einem dichten Durcheinander erfordern, ermöglicht das Wissen über die Objekthierarchie eine effiziente Manipulation (und die Vermeidung von Schäden an den Objekten).Therefore, in some applications, the controller 106 should consider the object hierarchy when deciding which object to include, e.g. B. if there is a stack of objects. This means that the robot arm 101 should not pick up an object that has another object on it that may fall and break when the robot arm 101 picks up the supporting object. The object hierarchy (information) here is intended to describe the relationships between objects 113 that are stacked on top of one another (i.e. have a stacking relationship), e.g. e.g., a first cup stacked on top of a second cup, etc. For applications that require the manipulation of objects in a dense mess, knowledge of the object hierarchy allows for efficient manipulation (and avoidance of damage to the objects).

Jedoch ist das Ableiten dieser Beziehung (d. h. der Objekthierarchie) aus einem einzelnen RGB-Bild eine von Natur aus schwierige Aufgabe, da möglicherweise mit mehreren verdeckten Objekten bei unterschiedlichen Lichtverhältnissen umgegangen werden muss. Gemäß verschiedenen Ausführungsformen wird ein maschinelles Lernmodell (das z. B. dem maschinellen Lernmodell 112 entspricht) in Form eines als Adjazenznetz (Adj-Net) bezeichneten neuronalen Netz bereitgestellt, das alle Objekte in einer Szene erkennt, die in einem Farb- (RGB) Bild gezeigt werden, und die räumliche Beziehung aus dem gegebenen Farbbild folgert. Dies verbessert tatsächlich die Objekterkennung, und kann somit sogar dann vorteilhaft sein, wenn die Adjazenzinformationen (die Objekthierarchie, d. h. räumliche Beziehung) selbst nicht in der weiteren Steuerung verwendet werden.However, inferring this relationship (i.e., object hierarchy) from a single RGB image is an inherently difficult task, as it may have to deal with multiple occluded objects in different lighting conditions. According to various embodiments, a machine learning model (e.g., corresponding to machine learning model 112) is provided in the form of a neural network called an adjacency network (Adj-Net) that detects all objects in a scene that are in a color (RGB) Image are shown, and the spatial relationship is inferred from the given color image. This actually improves object recognition, and thus can be beneficial even if the adjacency information (the object hierarchy, i.e. spatial relationship) itself is not used in further control.

Figure 2 veranschaulicht einen Betrieb der Steuerung 106 für eine Bin-Picking-Aufgabe unter Verwendung eines solchen Adjazenznetzes.Figure 2 illustrates operation of controller 106 for a bin picking task using such an adjacency network.

Die Steuerung 106 führt Folgendes durch:

1. Aufnehmen eines Bildes 201 der Szene (d. h. des Roboterarbeitsbereichs, der die Objekte 113 aufweist) mittels der Kamera 114.
2. Durchleiten des Bildes 201 durch das Adjazenznetz 202 (welches insbesondere als Objektdetektor arbeitet) zur gemeinsame Vorherzusage von N Vorschlägen für Begrenzungsrahmen 212 (durch ein Begrenzungsrahmen-MLP (mehrschichtiges Perzeptron) 203 des neuronalen Netzes 202), von Klassenwerten 213 für die Begrenzungsrahmen (durch einen Klassifizierer-MLP 204 des neuronalen Netzes 202), und einer einzelnen Adjazenzmatrix 211 der Größe N x N (durch Adjazenz-MLPs 205 des neuronalen Netzes 202, deren Ausgänge durch eine Kombinationsoperation 206 kombiniert werden).
3. Herausfiltern der „None“-Klassen aus der Ausgabe des Objektdetektors, um eine reduzierte Ausgabe von Begrenzungsrahmen 214, Klassen 215, und einer kleineren Adjazenzmatrix 216 zu erzeugen.
4. Weiterleiten dieser Ausgaben an weitere Komponenten der Steuerung (z. B. Bestimmen der tatsächlichen Greifpose und/oder Auswählen des zu greifenden Objekts unter Verwendung dieser Ausgaben).
5. Zurückkehren zu (1), d. h. Aufnehmen eines neues Bildes 201, falls erforderlich (z. B., um das nächste Objekt zu greifen).

The controller 106 performs the following:

1. Capture an image 201 of the scene (ie, the robot workspace containing the objects 113) using the camera 114.
2. Passing the image 201 through the adjacency network 202 (which in particular works as an object detector) for the joint prediction of N proposals for bounding boxes 212 (by a bounding box MLP (multi-layer perceptron) 203 of the neural network 202), class values 213 for the bounding boxes ( by a classifier MLP 204 of the neural network 202), and a single adjacency matrix 211 of size N x N (by adjacency MLPs 205 of the neural network 202, the outputs of which are combined by a combining operation 206).
3. Filter out the "None" classes from the output of the object detector to produce a reduced output of bounding boxes 214, classes 215, and a smaller adjacency matrix 216.
4. Forwarding these outputs to other components of the controller (e.g., determining the actual grasping pose and/or selecting the object to be grasped using these outputs).
5. Return to (1), ie, capture a new image 201 if necessary (e.g., to grab the next object).

Die Bilder 201 sind beispielsweise RGB-D-Bilder (d. h. Farbbilder mit Tiefeninformationen). Andere Arten von Bilddaten, z. B. Video-, Radar-, LiDAR-, Ultraschall, Bewegungs-, Wärme-Bilder, oder Sonar, können ebenfalls (alternativ oder zusätzlich) verwendet werden. Diese können durch entsprechende Bildsensoren (Kameras) erfasst werden.The images 201 are, for example, RGB-D images (i.e., color images with depth information). Other types of image data, e.g. B. Video, radar, LiDAR, ultrasound, motion, thermal images, or sonar can also be used (alternatively or additionally). These can be captured by appropriate image sensors (cameras).

Die weiteren Steuerungskomponenten 210 (z. B. ein Aktionsselektor) können beispielsweise die Adjazenzmatrix 209 (im Allgemeinen Objekthierarchieinformationen über die Objekte) verwenden, um eines oder mehreres des Folgenden zu bestimmen:

• Die sicherste Reihenfolge beim Greifen von Objekten, z. B. für Ansammlungen zerbrechlicher Objekte.
• Das beste Objekt zum Schieben, um die Unordnung für ein effizientes Greifen zu verteilen, z. B. für objektspezifische Greifaufgaben.
• Der beste nächste Blickwinkel für das neue Bild zur Verbesserung der probabilistischen Darstellung der Szene.

For example, the additional control components 210 (e.g., an action selector) may use the adjacency matrix 209 (generally object hierarchy information about the objects) to determine one or more of the following:

• The safest order when grasping objects, e.g. B. for accumulations of fragile objects.
• Best object to push to distribute the mess for efficient grasping e.g. B. for object-specific gripping tasks.
• The best closest viewing angle for the new image to improve the probabilistic representation of the scene.

Die Steuerung kann dann den Roboter entsprechend steuern.The controller can then control the robot accordingly.

Das maschinelle Lernmodell 112 wird durch ein maschinelles Lernverfahren (ML) trainiert (entweder durch die Steuerung 106 oder durch eine andere Vorrichtung, von der es dann in der Steuerung 106 gespeichert wird), um aus einer Bildeingabe eine Adjazenzmatrixausgabe 211 zu erzeugen, die einen probabilistischen Objekthierarchiegraphen darstellt.The machine learning model 112 is trained by a machine learning (ML) method (either by the controller 106 or by another device from which it is then stored in the controller 106) to generate an adjacency matrix output 211 from an image input, which is a probabilistic represents object hierarchy graphs.

3 zeigt ein exemplarisches Szenario mit drei Objekten, einer Maus (bezeichnet als Objekt a), einer Klebebandrolle (bezeichnet als Objekt b), und einer Schachtel (bezeichnet als Objekt c). 3 shows an example scenario with three objects, a mouse (referred to as object a), a roll of tape (referred to as object b), and a box (referred to as object c).

Die Objekthierarchie kann als gerichteter Graph dargestellt werden. Ein gerichteter Graph $G ≜ {V, E}$

ist eine Menge von Knoten V, die mit gerichteten Kanten

verbunden sind. Wenn die Objekthierarchie feststeht, reicht diese Darstellung aus, um die Beziehung zwischen den Objekten zu erfassen. Jeder Knoten stellt ein Objekt dar, und jede Kante stellt die Stapelbeziehung zwischen verbundenen Knoten dar. Ist beispielsweise ein Objekt b über einem Objekt a platziert, wird es daher als v_a und v_b mit einer Kante ε_ba dargestellt, die sich von b nach a erstreckt, wobei v ∈ V und ε ∈

Wenn die Hierarchie jedoch unbekannt ist und aus unvollständigen Daten abgeleitet wird, z. B. aus Bildern einer Szene, ist diese Darstellung unzureichend und aufgrund von Unsicherheiten bei der Inferenz fehleranfällig. Daher wird gemäß verschiedenen Ausführungsformen die Objekthierarchie als ein gewichteter gerichteter Graph

G ≜ {V, E, W}

dargestellt, wobei jeder Kante ε ∈ V ein Gewicht ω_ε ∈ W zugeordnet ist. Das Gewicht ist die Wahrscheinlichkeit, dass die Kante bei einem gegebenen Bild I existiert, also die Wahrscheinlichkeit, dass sich die entsprechenden Objekte in Richtung der Kante stapeln:

ω_{\in} ≜ ℙ (\in | I)

The object hierarchy can be represented as a directed graph. A directed graph

G ≜ {v, E}

is a set of vertices V with directed edges

are connected. If the object hierarchy is fixed, this representation is sufficient to capture the relationship between the objects. Each node represents an object, and each edge represents the stacking relationship between connected nodes. For example, if an object b is placed above an object a, it is therefore represented as v _a and v _b with an edge ε _ba extending from b to a extends, where v ∈ V and ε ∈

However, if the hierarchy is unknown and derived from incomplete data, e.g. B. from images of a scene, this representation is inadequate and prone to errors due to uncertainties in inference. Therefore, according to various embodiments, the object hierarchy is represented as a weighted directed graph

G ≜ {v, E, W}

shown, where each edge ε ∈ V is assigned a weight ω _ε ∈ W. The weight is the probability that the edge exists for a given image I, i.e. the probability that the corresponding objects stack in the direction of the edge:

ω_{\in} ≜ ℙ (\in | I)

Es versteht sich, dass als eine Wahrscheinlichkeit eines Träger der A_i,j's zwischen 0 und 1 liegt. Außerdem sind die diagonalen Elemente alle gleich 0.It is understood that as a probability of a carrier the A _i,j 's is between 0 and 1. In addition, the diagonal elements are all equal to 0.

4 zeigt einen Grundwissen-Objekthierarchiegraphen 401 und ein Beispiel für einen geschätzten Objekthierarchiegraphen (mit probabilistischen, d. h. Konfidenz-Informationen) 402 für das Drei-Objekte-Szenario von 3. Der Grundwissen-Graph 401 spezifiziert, dass das Objekt a direkt über beiden Objekten b und c liegt, und der Graph 402 der geschätzten Objekthierarchie gibt eine Schätzung mit probabilistischen Kantengewichten, die die Konfidenz der jeweiligen Beziehungen zwischen den Objekten ausdrücken. 4 shows a basic knowledge object hierarchy graph 401 and an example of an estimated object hierarchy graph (with probabilistic, ie, confidence information) 402 for the three-object scenario of 3 . The basic knowledge graph 401 specifies that object a lies directly above both objects b and c, and the estimated object hierarchy graph 402 gives an estimate with probabilistic edge weights that express the confidence of the respective relationships between the objects.

Gemäß verschiedenen Ausführungsformen wird ein Objekthierarchiegraph

in Form einer Adjazenzmatrix dargestellt, die als A

bezeichnet wird, oder einfach als A, um die Notationen abzukürzen. Die Adjazenzmatrix ist eine quadratische Matrix mit den Dimensionen N_V × N_V, wobei N_V die Anzahl der Knoten, d. h. der in der Szene erkannten Objekte ist. Jedes Element A_i,j entspricht der Kante der Objekte i bis j:

A_{i, j} = ω_{\in}^{i j} ≜ ℙ (\in_{i j} | I)

According to various embodiments, an object hierarchy graph

represented in the form of an adjacency matrix, denoted as A

or simply A to abbreviate the notations. The adjacency matrix is a square matrix with dimensions N _V × N _V , where N _V is the number of nodes, that is, objects detected in the scene. Each element A _i,j corresponds to the edge of objects i to j:

A_{i, j} = ω_{\in}^{i j} ≜ ℙ (\in_{i j} | I)

In dem Beispiel von 4 ist die Grundwissen-Adjazenzmatrix, die dem Grundwissen-Objekthierarchiegraphen entspricht $A = [\begin{matrix} 0 & 1 & 1 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{matrix}],$

wohingegen die geschätzte Adjazenzmatrix, die den Graphen der geschätzten Objekthierarchie darstellt, ist:

A = [\begin{matrix} 0 & 0.9 & 0.6 \\ 0.2 & 0 & 0.7 \\ 0 & 0 & 0 \end{matrix}] .

In the example of 4 is the basic knowledge adjacency matrix corresponding to the basic knowledge object hierarchy graph

A = [\begin{matrix} 0 & 1 & 1 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{matrix}],

whereas the estimated adjacency matrix representing the graph of the estimated object hierarchy is:

A = [\begin{matrix} 0 & 0.9 & 0.6 \\ 0.2 & 0 & 0.7 \\ 0 & 0 & 0 \end{matrix}] .

Im Folgenden wird die Netzwerkarchitektur des Adjazenznetzes Adj-Net 202 ausführlicher beschrieben. Gemäß verschiedenen Ausführungsformen verwendet das Adjazenznetz 202 als Backbone ein Transformator-Encoder-Decoder-(Objektdetektor-)Netz (wie es z. B. in Referenz 1 beschrieben wird), das durch einen Encoder 207 und einen Decoder 208 gebildet wird. Dem Encoder 207 kann ein Extraktor 217 für visuelle Merkmale vorangestellt sein (z. B. ein konvolutionelles neuronales Netz, wie etwa ResNet), der das aktuelle Bild 201 als Eingabe empfängt. Dem Decoder 208 werden Abfragen 218 zugeführt, die sichtbar sind, um Bereiche des Eingangsbildes 201 zu spezifizieren, wo das Adjazenznetz 202 nach Objekten sucht. Die Anzahl der Abfragen ist beispielsweise ein vordefinierter Hyperparameter, aber die Parameter der Abfragen (d. h. Bildorte) können während des Trainings des Adjazenznetzes 202 gelernt werden.The network architecture of the adjacency network Adj-Net 202 is described in more detail below. According to various embodiments, the adjacency network 202 uses as a backbone a transformer-encoder-decoder (object detector) network (such as described in Reference 1) formed by an encoder 207 and a decoder 208. The encoder 207 may be preceded by a visual feature extractor 217 (e.g., a convolutional neural network such as ResNet) that receives the current image 201 as input. The decoder 208 is provided with queries 218 that are visible to specify areas of the input image 201 where the adjacency network 202 looks for objects. For example, the number of queries is a predefined hyperparameter, but the parameters of the Queries (ie, image locations) may be learned during training of the adjacency network 202.

Das Adjazenznetz 202 hat einen Adjazenzkopf (der durch eine Projektionsoperation 209, Adjazenz-MLPs 205, und die Kombinationsoperation 206 implementiert wird), der von dem Encoder 207 eine Matrix der Merkmalsvektorgröße E × H erhält, wobei E die Anzahl der Encoder-Merkmalsvektoren ist, und H ist die Dimension eines jeweiligen Merkmalsvektors ist.The adjacency network 202 has an adjacency head (implemented by a projection operation 209, adjacency MLPs 205, and the combination operation 206) which receives from the encoder 207 a matrix of feature vector size E × H, where E is the number of encoder feature vectors, and H is the dimension of a respective feature vector.

Die Objekterkennung ist auch eine wichtige Komponente des Adj-Net 202, daher hat es sowohl Klassifizierungs- als auch Begrenzungsrahmen-Vorhersageköpfe (implementiert durch das Klassifizierungs-MLP 204 bzw. das Begrenzungsrahmen-MLP 203. Der Decoder 208 gibt eine Matrix von Merkmalsvektoren der Größe Q × H aus, wobei Q eine voreingestellte Anzahl möglicher Begrenzungsrahmen-Kandidaten ist. Jeder Merkmalsvektor durchläuft den Begrenzungsrahmen MLP 203 und erzeugt einen Satz vorhergesagter Begrenzungsrahmenkoordinaten als eine Matrixgröße Q × 4. Der Klassifizierer MLP 204 empfängt einen verketteten Merkmalsvektor sowohl der Decoderausgabe als auch der Begrenzungsrahmenkoordinaten und gibt eine Q-Wahrscheinlichkeit aus Vektoren 213 für jeden entsprechenden Begrenzungsrahmen mit einer Länge von C+1 aus, wobei C einer jeweiligen Kandidatenklasse entspricht, und der zusätzliche Begriff entspricht einer „leeren“ Klasse, die dazu vorgesehen ist, Begrenzungsrahmenkandidaten herauszufiltern, die entweder Bereiche ohne Objekte markieren oder eine Überlappung mit früheren Objekterkennungen sind.Object detection is also an important component of the Adj-Net 202, so it has both classification and bounding box prediction heads (implemented by the classification MLP 204 and the bounding box MLP 203, respectively. The decoder 208 outputs a matrix of feature vectors of size Q × H, where Q is a preset number of possible bounding box candidates. Each feature vector passes through the bounding box MLP 203 and produces a set of predicted bounding box coordinates as a matrix size Q × 4. The classifier MLP 204 receives a concatenated feature vector of both the decoder output and the Bounding box coordinates and outputs a Q probability of vectors 213 for each corresponding bounding box of length C+1, where C corresponds to a respective candidate class, and the additional term corresponds to an "empty" class intended to filter out bounding box candidates that either mark areas without objects or are an overlap with previous object detections.

Zum Vorhersagen der Adjazenzmatrix 211 wird die Encoderausgabe durch den Projektionsblock geleitet, der einem doppelten Zweck dient; erstens projiziert er die Dimension der Encoderausgabe auf eine Merkmalsmatrixgröße Q × H, um mit den Begrenzungsrahmenvorschlägen übereinzustimmen; zweitens denkt er über die Verbindungen zwischen Encodermerkmalsbeziehungen nach, und versucht, sie mit der Hierarchie zwischen den Objekten zu verbinden.To predict the adjacency matrix 211, the encoder output is passed through the projection block, which serves a dual purpose; first, it projects the dimension of the encoder output onto a feature matrix size Q × H to match the bounding box proposals; second, it thinks about the connections between encoder feature relationships and tries to connect them to the hierarchy between objects.

Danach wird die Merkmalsmatrix mit den Begrenzungsrahmen-Vorhersagen 212 verkettet und in zwei Kanäle aufgeteilt. Jeder Kanal wird durch einen auf einem neuronalen Netz basierten Adjazenzblock (d. h. Adjazenz-MLP 205) geleitet, der die Dimensionen beibehält, und dann werden die Ausgaben von jedem Kanal multipliziert und durch eine Sigmoid-Schicht geleitet (d. h. durch die kombinierende Kombination 206 verarbeitet), um eine vorläufige Q × Q-Adjazenzmatrix 211 zu schaffen.The feature matrix is then concatenated with the bounding box predictions 212 and split into two channels. Each channel is passed through a neural network-based adjacency block (i.e., adjacency MLP 205) that maintains dimensions, and then the outputs from each channel are multiplied and passed through a sigmoid layer (i.e., processed by the combining combination 206) to create a preliminary Q × Q adjacency matrix 211.

Jede Zeile und Spalte der Nachbarschaft in der Adjazenzmatrix 211 entspricht einem Vorschlag für einen Begrenzungsrahmen, und der Klassifizierer filtert die Vorschläge für die Begrenzungsrahmen über die „leere“ Klasse, d. h. Vorschläge, für die der Klassifizierer eine „leere“ Klasse vorhersagt, werden verworfen.Each row and column of the neighborhood in the adjacency matrix 211 corresponds to a bounding box proposal, and the classifier filters the bounding box proposals over the “empty” class, i.e. H. Proposals for which the classifier predicts an “empty” class are discarded.

x_e bezeichne die Ausgabe des Encoders 207, x_d die Begrenzungsrahmen-Vorschlagskoordinaten 212, χ'_e die projizierte Encoderausgabe, verkettet mit x_d, und die Funktionen g(·), ƒ₁(·) and ƒ₂(·) beschreiben den Projektionsblock 209 bzw. beide Nachbarblöcke 205. Die Adjazenzkopfoperation kann dann wie folgt beschrieben werden: $χ_{e}^{'} = g (x_{e}^{T}) \cup x_{b}$

A_{p} = Sigmoid (\frac{ƒ_{1} {(χ_{e}^{'})}^{T} \cdot ƒ_{2} (χ_{e}^{'})}{H})

wobei A_p die vorläufige Adjazenzmatrix 211 ist.Let x _e denote the output of the encoder 207, x _d the bounding box proposal coordinates 212, χ' _e the projected encoder output concatenated with x _d , and the functions g(·), ƒ ₁ (·) and ƒ ₂ (·) describe the Projection block 209 or both neighboring blocks 205. The adjacency head operation can then be described as follows:

χ_{e}^{'} = G (x_{e}^{T}) \cup x_{b}

A_{p} = Sigmoid (\frac{ƒ_{1} {(χ_{e}^{'})}^{T} \cdot ƒ_{2} (χ_{e}^{'})}{H})

where A _p is the preliminary adjacency matrix 211.

Beispielsweise ist die Dimension des verborgenen Zustands H = 256, wobei der Encoder E = 17821 Merkmalsvektoren ausgibt und Q = 300 Vorschläge für Begrenzungsrahmen berücksichtigt werden. Die Größe jedes Klassenwahrscheinlichkeitsvektors hängt von der Situation des Problems ab. Beispielsweise ist zum Testen mit dem VMRD-Datensatz (Visual Manipulation Relationship Dataset) C+1 = 32. Eine Binäre Klassifizierung (d. h. Objekt oder kein Objekt, C+1 = 2) kann auf viele Probleme im wirklichen Leben angewendet werden.For example, the dimension of the hidden state is H = 256, where the encoder outputs E = 17821 feature vectors, and Q = 300 bounding box proposals are considered. The size of each class probability vector depends on the situation of the problem. For example, for testing on the Visual Manipulation Relationship Dataset (VMRD), C+1 = 32. Binary classification (i.e. object or no object, C+1 = 2) can be applied to many real-life problems.

Wie oben beschrieben, werden mehrschichtige Perzeptrons (MLPs) beispielsweise für den Projektionsblock 209, den Begrenzungsrahmen, die Klassifizierung und beide Kanäle des Adjazenzkopfes verwendet.As described above, multi-layer perceptrons (MLPs) are used, for example, for the projection block 209, the bounding box, the classification and both channels of the adjacency head.

Das Adj-Net 202 kann auf überwachte Weise aus einem Trainingssatz trainiert werden, der aus RGB-Bildeingaben und Labels besteht, die aus den Grundwissen-Klassen, Begrenzungsrahmen und räumlichen Beziehungen bestehen. Zwei online verfügbare Beispiele für solche Datensätze sind VMRD und REGRAD (Relational Grasp Dataset). Bei einem gegebenen bezeichnetem Bild erzeugt Adj-Net 202 rohe Begrenzungsrahmen, Klassifizierungsbewertungen und eine Adjazenzmatrix.The Adj-Net 202 can be trained in a supervised manner from a training set consisting of RGB image inputs and labels consisting of the basic knowledge classes, bounding boxes and spatial relationships. Two examples of such datasets available online are VMRD and REGRAD (Relational Grasp Dataset). Given a labeled image, Adj-Net 202 produces raw bounding boxes, classification scores, and an adjacency matrix.

Um einen Verlust zu berechnen, werden die abgeleiteten Begrenzungsrahmen mit den „nächstgelegenen“ Zielbegrenzungsrahmen und -klassen abgeglichen. Der ungarische Algorithmus kann beispielsweise zum Abgleichen der Begrenzungsrahmen verwendet werden. Außerdem werden die relevanten Zeilen und Spalten in A_p abgeleitet, und es wird eine abgeglichene Adjazenzmatrix A_m gebildet, die die Einträge aufweist, die den abgeglichenen Begrenzungsrahmen entsprechen.To calculate a loss, the derived bounding boxes are matched to the “nearest” target bounding boxes and classes. For example, the Hungarian algorithm can be used to match the bounding boxes. In addition, the rele vant rows and columns in A _p and a matched adjacency matrix A _m is formed that has the entries corresponding to the matched bounding boxes.

Beispielsweise gibt es viele (z. B. 300) vorgeschlagene Begrenzungsrahmen, so dass (in vielen Anwendungsfällen) die meisten von ihnen keinem Objekt entsprechen. Die Adjazenzmatrix A_p beschreibt die Beziehungen zwischen all diesen 300 Objekten. Der Klassifizierungskopf bestimmt jeweils, ob der Vorschlag verworfen werden soll (über die Klasse „kein Objekt“) oder nicht. Aus den nicht verworfenen Begrenzungsrahmen, z. B. 5 Objekte, ist A_m die 5x5 Adjazenzmatrix, die die Beziehung zwischen diesen Objekten beschreibt. Beispielsweise vergleicht der Matcher im Training, unter Verwendung des ungarischen Abgleichs-Algorithmus als Matcher, aus den 300 Vorschlägen die „nächstliegenden“ mit der Zieladjazenzmatrix. Unter der Annahme, dass das Grundwissen eine 4x4-Adjazenzmatrix ist, findet der Matcher über eine interne Kostenfunktion die 4 Vorschläge, die den Zielen am nächsten liegen, und das Training wird in Bezug auf diese abgeglichenen Vorschläge durchgeführt. Für die Inferenz (d. h. nach dem Training) ist dieser Abgleich nicht erforderlich.For example, there are many (e.g. 300) proposed bounding boxes, so (in many use cases) most of them do not correspond to any object. The adjacency matrix A _p describes the relationships between all of these 300 objects. The classification header determines whether the proposal should be rejected (via the “no object” class) or not. From the non-discarded bounding boxes, e.g. For example, if there are 5 objects, A _m is the 5x5 adjacency matrix that describes the relationship between these objects. For example, in training, using the Hungarian matching algorithm as a matcher, the matcher compares the “closest” ones from the 300 proposals with the target adjacency matrix. Assuming that the baseline is a 4x4 adjacency matrix, the matcher finds the 4 proposals that are closest to the targets via an internal cost function, and training is performed on these matched proposals. This comparison is not required for inference (i.e. after training).

Der Klassifizierungskopf wird über gewichteten Kreuzentropieverlust trainiert. Für ein Bild I mit Vorschlägen ρ ∈

und die Klasse c beträgt der gewichtete Kreuzentropieverlust:

L_{c} = - \sum_{p \in P} \sum_{c = 1}^{C + 1} w_{c} \cdot ℙ_{g t} (c_{p} | I) log ℙ (c_{p} | I)

The classifier is trained via weighted cross-entropy loss. For an image I with proposals ρ ∈

and class c is the weighted cross-entropy loss:

L_{c} = - \sum_{p \in P} \sum_{c = 1}^{C + 1} w_{c} \cdot ℙ_{G t} (c_{p} | I) log ℙ (c_{p} | I)

Da die Anzahl der Erkennungen, für die die Grundwissen-Klasse „unbekannt“ ist, im Allgemeinen deutlich größer als die Anzahl der Objekte in der Szene ist, wird die Klasse „unbekannt“ mit einem kleinen w_C+1 << 1 gewichtet, während für andere Klassen w_c = 1 ist. Für die abgeglichenen Begrenzungsrahmen wird eine lineare Kombination aus dem 1₁- (absolute Abweichung) Verlust $L_{l_{1}}$

und dem von generalisiertem Überlappungs- (GloU) Verlust

L_{G I o U}

verwendet.Since the number of detections for which the basic knowledge class is "unknown" is generally significantly larger than the number of objects in the scene, the class "unknown" is weighted with a small w _C+1 << 1, while for other classes w _c = 1. For the matched bounding boxes, a linear combination of the 1 ₁ (absolute deviation) loss is used

L_{l_{1}}

and that of generalized overlap (GloU) loss

L_{G I O U}

used.

Da die Adjazenzmatrix A_m (und durch Erweiterung A_p) im Allgemeinen eine dünnbesetzte Matrix mit vielen 0-Werten und wenigen 1-Werten ist, besteht die Gefahr, dass das Adjazenznetz 202 während des Lernens zu dem geeigneten lokalen Minimum einer Matrix aus Nullen konvergiert. Dies ist ein erhebliches Problem bei der Verwendung von elementweisen l₁- und l₂-Verlusten, da sie eine falsche 0 nicht wesentlich bestrafen. Dieses Problem wird noch verschärft, wenn versucht wird, die gesamte A_p zu lernen, die mit Nullen oder Einsen für die Zeilen und Spalten aufgefüllt ist, die nicht übereinstimmen. Somit wird gemäß verschiedenen Ausführungsformen ein binärer elementweiser Kreuzentropieverlust über Am verwendet gemäß $L_{a d j} = - \sum_{i, j \in A_{m}} (A_{i j}^{g t} {logA}_{i j} + (1 - A_{i j}^{g t}) log (1 - A_{i j})),$

wobei A_ij das (i, j)-te Element von A_m ist und

A_{i j}^{g t}

das Grundwissen dieses Elements ist. Dies führt potenziell zu einem langsameren Lernen, vermeidet jedoch das lokale Minimum einer Adjazenzmatrix aus Nullen.Since the adjacency matrix A _m (and by extension A _p ) is generally a sparse matrix with many 0s and few 1s, there is a risk that the adjacency network 202 will converge to the appropriate local minimum of a matrix of zeros during learning . This is a significant problem when using element-wise l ₁ and l ₂ losses, as they do not significantly penalize a false 0. This problem is exacerbated when trying to learn the entire _Ap , which is padded with zeros or ones for the rows and columns that do not match. Thus, according to various embodiments, a binary element-wise cross-entropy loss over Am is used according to

L_{a d j} = - \sum_{i, j \in A_{m}} (A_{i j}^{G t} {logA}_{i j} + (1 - A_{i j}^{G t}) log (1 - A_{i j})),

where A _ij is the (i, j)th element of A _m and

A_{i j}^{G t}

is the basic knowledge of this element. This potentially results in slower learning, but avoids the local minimum of an adjacency matrix of zeros.

Die Gesamtverlustfunktion ist eine gewichtete Summe aller Einzelverlustfunktionen: $L_{t o t a l} = γ_{c} L_{c} + γ_{l_{1}} L_{l_{1}} + γ_{G I o U} L_{G I o U} + γ_{a d j} L_{a d j},$

wobei die γ allesamt Hyperparameter sind.The overall loss function is a weighted sum of all individual loss functions:

L_{t O t a l} = γ_{c} L_{c} + γ_{l_{1}} L_{l_{1}} + γ_{G I O U} L_{G I O U} + γ_{a d j} L_{a d j},

where the γ are all hyperparameters.

Das Adjazenznetz 202 wird beispielsweise in zwei Phasen trainiert.The adjacency network 202, for example, is trained in two phases.

In der ersten Phase wird nur der Detektor (Klassifizierungskopf und Begrenzungsrahmenkopf) durch Weglassen von $L_{a d j}$

vortrainiert, indem γ_adj = 0 gesetzt wird. Beispielsweise wird die Superkonvergenz-Lernraten-Planungs-Technik mit einer maximalen Lernrate von 0,01 in Verbindung mit dem AdmaW-Optimierer mit einem Gewichtsabfall von 1×10-6 verwendet. Die Superkonvergenztechnik ermöglicht ein schnelleres Training und dient als Netzregularisierungstechnik für eine bessere Verallgemeinerung.In the first phase, only the detector (classification head and bounding box head) is used by omitting

L_{a d j}

pre-trained by setting γ _adj = 0. For example, the superconvergence learning rate scheduling technique with a maximum learning rate of 0.01 is used in conjunction with the AdmaW optimizer with a weight decay of 1×10-6. The superconvergence technique enables faster training and serves as a network regularization technique for better generalization.

In der zweiten Phase wird $L_{a d j}$

wieder in die Verlustfunktion eingeführt, indem γ_adj auf einen Wert größer als Null gesetzt wird, und das Training wird fortgesetzt, z. B. über die Superkonvergenz und einem AdamW mit einer maximalen Lernrate von 0,001.In the second phase,

L_{a d j}

reintroduced into the loss function by setting γ _adj to a value greater than zero and training continues, e.g. B. via superconvergence and an AdamW with a maximum learning rate of 0.001.

Zusammenfassend wird gemäß verschiedenen Ausführungsformen ein Verfahren zum Steuern einer Robotervorrichtung bereitgestellt, wie in 6 veranschaulicht wird.In summary, according to various embodiments, a method for controlling a robotic device is provided, as in 6 is illustrated.

5 zeigt ein Flussdiagramm 500, das ein Verfahren zur Steuerung einer Robotervorrichtung gemäß einer Ausführungsform darstellt. 5 shows a flowchart 500 illustrating a method for controlling a robot device according to an embodiment.

In 501 werden, für jedes Bild einer Mehrzahl von Trainingsbildern,

• Merkmalsvektoren aus dem Trainingsbild durch einen Encoder eines Transformator-Encoder-Decoder-Netzes bestimmt, in 502,
• für jeden einer Mehrzahl von Objektbegrenzungsrahmen Begrenzungsrahmenkoordinaten aus den Merkmalsvektoren durch einen Decoder des Encoder-Decoder-Netzes und einen Begrenzungsrahmenkopf bestimmt, in 503;
• Objekthierarchieinformationen aus den Begrenzungsrahmenkoordinaten bestimmt für die Mehrzahl von Objekten und Merkmalen für die Begrenzungsrahmen, die von den Merkmalsvektoren abgeleitet werden, in 504, wobei die Objekthierarchieinformationen Stapelbeziehungen der Mehrzahl von Objekten in Bezug zueinander spezifizieren;
• Klassifizierungsinformationen über Objekte, die den Objektbegrenzungsrahmen entsprechen, durch einen Klassifizierungskopf bestimmt, in 505.

In 501, for each image of a plurality of training images,

• Feature vectors from the training image determined by an encoder of a transformer-encoder-decoder network, in 502,
• for each of a plurality of object bounding boxes, bounding box coordinates from the feature vectors determined by a decoder of the encoder-decoder network and a bounding box header, in 503;
• Object hierarchy information from the bounding box coordinates determined for the plurality of objects and features for the bounding boxes derived from the feature vectors in 504, the object hierarchy information specifying stacking relationships of the plurality of objects with respect to one another;
• Classification information about objects corresponding to the object bounding boxes, determined by a classification header, in 505.

In 506 wird das neuronale Netz trainiert, um einen Verlust zwischen Grundwissen für die Hierarchieinformationen, für die Begrenzungsrahmenkoordinaten und für die Klassifizierungsinformationen und den Hierarchieinformationen zu reduzieren, wobei die Begrenzungsrahmenkoordinaten und die Klassifizierungsinformationen jeweils für die Trainingsbilder bestimmt wurden.In 506, the neural network is trained to reduce a loss between basic knowledge for the hierarchy information, for the bounding box coordinates and for the classification information and the hierarchy information, where the bounding box coordinates and the classification information have been determined for the training images, respectively.

Dies bedeutet, dass das neuronale Netz mit der Summe eines Verlustes zwischen Grundwissen für die Hierarchieinformationen und die abgeleiteten Hierarchieinformationen, eines Verlustes zwischen Grundwissen für die Begrenzungsrahmenkoordinaten und die abgeleiteten Begrenzungsrahmenkoordinaten, und eines Verlustes zwischen Grundwissen für die Klassifizierungsinformationen und die abgeleiteten Klassifizierungsinformationen trainiert wird.This means that the neural network is trained with the sum of a loss between basic knowledge for the hierarchy information and the derived hierarchy information, a loss between basic knowledge for the bounding box coordinates and the derived bounding box coordinates, and a loss between basic knowledge for the classification information and the derived classification information.

Während in den obigen Ausführungsformen der Ansatz von 5 auf die Steuerung eines Roboterarms angewendet wird, kann er zum Berechnen eines Steuersignals zum Steuern eines beliebigen technischen Systems in einem Szenario angewendet werden, in dem eine Objekthierarchie eine Rolle spielt, wie z. B. einer computergesteuerte Maschine, wie ein Roboter, ein Fahrzeug, ein Haushaltsgerät, ein Elektrowerkzeug, eine Fertigungsmaschine, ein persönlicher Assistent, oder ein Zutrittskontrollsystem.While in the above embodiments the approach of 5 is applied to the control of a robot arm, it can be applied to calculate a control signal for controlling any engineering system in a scenario where object hierarchy plays a role, such as: B. a computer-controlled machine, such as a robot, a vehicle, a household appliance, a power tool, a manufacturing machine, a personal assistant, or an access control system.

Gemäß einer Ausführungsform ist das Verfahren computerimplementiert.According to one embodiment, the method is computer implemented.

Obwohl hierin spezifische Ausführungsformen dargestellt und beschrieben wurden, wird der Durchschnittsfachmann erkennen, dass eine Vielzahl von alternativen und/oder äquivalenten Implementierungen die gezeigten und beschriebenen spezifischen Ausführungsformen ersetzen können, ohne dabei vom Umfang der vorliegenden Erfindung abzuweichen. Diese Anmeldung soll alle Anpassungen oder Variationen der hier diskutierten spezifischen Ausführungsformen abdecken. Daher soll diese Erfindung nur durch die Ansprüche und deren Äquivalente beschränkt sein.Although specific embodiments have been shown and described herein, those of ordinary skill in the art will recognize that a variety of alternative and/or equivalent implementations may replace the specific embodiments shown and described without departing from the scope of the present invention. This application is intended to cover any adaptations or variations of the specific embodiments discussed herein. Therefore, this invention is intended to be limited only by the claims and their equivalents.

ZITATE ENTHALTEN IN DER BESCHREIBUNGQUOTES INCLUDED IN THE DESCRIPTION

Diese Liste der vom Anmelder aufgeführten Dokumente wurde automatisiert erzeugt und ist ausschließlich zur besseren Information des Lesers aufgenommen. Die Liste ist nicht Bestandteil der deutschen Patent- bzw. Gebrauchsmusteranmeldung. Das DPMA übernimmt keinerlei Haftung für etwaige Fehler oder Auslassungen.This list of documents listed by the applicant was generated automatically and is included solely for the better information of the reader. The list is not part of the German patent or utility model application. The DPMA assumes no liability for any errors or omissions.

Zitierte Nicht-PatentliteraturNon-patent literature cited

N. Carion et al., “End-to-end object detection with transformers,” in the European Conference on Computer Vision, Springer, 2020, pages 213-229 [0002]

Claims

Method for training a neural network to recognize objects in image data, comprising: for each image of a plurality of training images determining feature vectors from the training image by an encoder of a transformer-encoder-decoder network; determining, for each of a plurality of object bounding boxes, bounding box coordinates from the feature vectors by a decoder of the encoder-decoder network and a bounding box head; determining object hierarchy information from the bounding box coordinates for the plurality of objects and features for the bounding boxes derived from the feature vectors, the object hierarchy information specifying stacking relationships of the plurality of objects with respect to one another; determining classification information about objects corresponding to the object bounding boxes through a classification head; and Training the neural network to reduce a loss between ground truths for the hierarchy information, for the bounding box coordinates and for the classification information and the hierarchy information, where the bounding box coordinates and the classification information were determined for the training images, respectively.

Procedure according to Claim 1 , comprising transforming the feature vectors into a feature vector for each object bounding box, concatenating, for each object bounding box, the feature vector for the bounding boxes with the bounding box coordinates for the object bounding boxes to an input vector, and determining the object hierarchy information from the input vectors.

Procedure according to Claim 2 , comprising determining the object hierarchy information from the input vectors by a plurality of processing channels, each processing channel receiving the input vectors for the object bounding boxes and generating an adjacency matrix component, and multiplying the adjacency matrix components generated by the channels to generate an adjacency matrix specifying the object hierarchy information.

Procedure according to one of the Claims 1 until 3 , where the neural network is trained using a combined loss of a cross-entropy loss for the classification information, a binary element-wise loss for the object hierarchy information, and a linear combination of absolute deviation loss and generalized overlap loss for the bounding box coordinates.

Method for controlling a robotic device, comprising: training a machine learning model according to one of Claims 1 until 4 ; Capturing a camera image showing the one or more objects; Feeding the camera image into the neural network to recognize the one or more objects; and controlling the robot taking into account the detected one or more objects.

Control that is designed to implement a method according to one of the Claims 1 until 5 to carry out.

Computer program that includes instructions that, when executed by a computer, cause the computer to perform a method according to one of the Claims 1 until 5 carries out.

A computer-readable medium containing instructions that, when executed by a computer, cause the computer to perform a method according to one of the Claims 1 until 5 carries out.