DE102022201161A1

DE102022201161A1 - Object classification with a one-level meta-based object detector using class prototypes

Info

Publication number: DE102022201161A1
Application number: DE102022201161.9A
Authority: DE
Inventors: Eduardo Monari; Matthias Kayser; Karim Guirguis; Mohamed Abdelsamad
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2022-02-03
Filing date: 2022-02-03
Publication date: 2023-08-03

Abstract

Verfahren (100) zur Erkennung und Klassifizierung von Instanzen von Objekten (2) in einem Eingabebild (1), umfassend die Schritte:• Extrahieren (110) eines Satzes von Eingabebildmerkmalen (3) aus dem Eingabebild (1) durch ein Merkmalsextraktionsnetzwerk (10);• Evaluieren (120), aus den Eingabebildmerkmalen (3), durch ein Lokalisierungsnetzwerk (20), von Orten (2a) von Instanzen von Objekten (2) in dem Eingabebild (1);• Erhalten (130), für jede einer Mehrzahl von Klassen (4a-4c), eines Satzes von Prototypmerkmalen (5a-5c) in der Domäne der Eingabebildmerkmale (3a-3c), der die jeweilige Klasse (4a-4c) repräsentieren; und• Bestimmen (140), aus den Eingabebildmerkmalen (3) in Kombination mit den Prototypmerkmalen (5a-5c), von Klassen (4a-4c), zu denen die Instanzen von Objekten (2) in dem Eingabebild (1) gehören.Method (100) for recognizing and classifying instances of objects (2) in an input image (1), comprising the steps of: • extracting (110) a set of input image features (3) from the input image (1) by a feature extraction network (10) ;• evaluating (120), from the input image features (3), by a localization network (20), locations (2a) of instances of objects (2) in the input image (1);• obtaining (130), for each of a plurality of classes (4a-4c), a set of prototype features (5a-5c) in the domain of the input image features (3a-3c) representing the respective class (4a-4c); and• determining (140), from the input image features (3) in combination with the prototype features (5a-5c), classes (4a-4c) to which the instances of objects (2) in the input image (1) belong.

Description

Die Erfindung betrifft einstufige Objektdetektoren, die Objektinstanzen in einem Abfragebild erkennen, und bestimmen, zu welchen Klassen diese Objektinstanzen gehören.The invention relates to single-stage object detectors that recognize object instances in a query image and determine to which classes those object instances belong.

Hintergrundbackground

Objektdetektoren bestimmen für ein gegebenes Eingabebild, welche Instanzen von Objekten wo im Eingabebild vorhanden sind und zu welcher Klasse eines Satzes verfügbarer Klassen jede Instanz gehört. Das Training solcher Objektdetektoren ist meist ein überwachtes Training mit Trainingsbildern, die hinsichtlich dessen, welche Objekte in den Trainingsbildern vorhanden sind und wo sich diese Objekte befinden, gekennzeichnet sind. Das Training erfordert eine große Anzahl von Trainingsbildern mit Objekten, die zu jeweiligen der unterschiedlichen zu erkennenden Klassen gehören. Außerdem wird eine große Menge an Rechenleistung benötigt, üblicherweise in Form von Grafikverarbeitungseinheiten, GPUs.Object detectors determine, for a given input image, which instances of objects exist where in the input image and to which class of a set of available classes each instance belongs. The training of such object detectors is mostly supervised training with training images labeled as to what objects are present in the training images and where those objects are located. The training requires a large number of training images with objects belonging to respective ones of the different classes to be recognized. A large amount of computing power is also required, usually in the form of graphics processing units, GPUs.

Die Klassen von Objekten, die der Objektdetektor erkennen muss, müssen möglicherweise nach dem anfänglichen Training aktualisiert werden. In diesem Fall ist es wünschenswert, dass es möglich es, den Objektdetektor nur auf die neuen Objekte, die erkannt werden müssen, zu trainieren, ohne das Training ganz von vorne beginnen zu müssen.The classes of objects that the object detector must recognize may need to be updated after initial training. In this case, it is desirable to be able to train the object detector only on the new objects that need to be detected, without having to start the training from scratch.

Offenbarung der ErfindungDisclosure of Invention

Die Erfindung stellt ein Verfahren zur Erkennung und Klassifizierung von Instanzen von Objekten in einem Eingabebild bereit. Das Erkennen kann dabei insbesondere ein Lokalisieren aufweisen. Das Eingabebild kann von jedem geeigneten Typ sein. Beispielsweise kann es sich um ein Kamerabild, ein Videobild, ein Wärmebild, ein Radarbild, ein LIDAR-Bild oder ein Sonarbild handeln.The invention provides a method for identifying and classifying instances of objects in an input image. In this case, the recognition can include, in particular, a localization. The input image can be of any suitable type. For example, it can be a camera image, a video image, a thermal image, a radar image, a LIDAR image, or a sonar image.

Im Zuge des Verfahrens wird aus dem Eingabebild durch ein Merkmalsextraktionsnetzwerk ein Satz von Eingabebildmerkmalen extrahiert. Hierbei kann ein Merkmal beispielsweise eine komprimierte Darstellung von Inhalten in einem bestimmten rezeptiven Feld umfassen. Beispielsweise erzeugt die Verarbeitung des Eingabebildes durch eine Faltungsschicht als Merkmalsextraktor eine Merkmalskarte des Eingabebildes. Das Merkmalsextraktionsnetzwerk kann beispielsweise ein neuronales Netz sein, das auf eine beliebige geeignete Weise trainiert wurde, um hervorstechende Merkmale aus Eingabebildern abzuleiten. Es kann hierfür im Rahmen der Erkennung und Klassifizierung von Objektinstanzen speziell trainiert worden sein, dies ist jedoch nicht erforderlich. Beispielsweise kann eine Autocodiereranordnung aus einem Codierer, der Merkmale aus Eingabebildern extrahiert, und einem Decodierer, der das ursprüngliche Eingabebild aus den extrahierten Merkmalen rekonstruiert, auf unüberwachte Weise trainiert werden. Der Codiererteil einer solchen Anordnung kann als Merkmalsextraktionsnetzwerk verwendet werden.In the course of the method, a set of input image features is extracted from the input image by a feature extraction network. Here, a feature can include, for example, a compressed representation of content in a specific receptive field. For example, processing the input image by a convolution layer as a feature extractor produces a feature map of the input image. For example, the feature extraction network may be a neural network trained in any suitable manner to extract salient features from input images. It may have been specially trained for this as part of the recognition and classification of object instances, but this is not required. For example, an autocoder arrangement can be trained in an unsupervised manner from an encoder that extracts features from input images and a decoder that reconstructs the original input image from the extracted features. The encoder part of such an arrangement can be used as a feature extraction network.

Aus den Merkmalen des Eingabebildes bestimmt ein Lokalisierungsnetzwerk Orte von Instanzen von Objekten in dem Eingabebild. Diese Orte können beispielsweise als Begrenzungsboxen bestimmt werden. Das Lokalisierungsnetzwerk kann auf jede geeignete Weise trainiert worden sein. Beispielsweise kann es auf Trainingseingabebildern trainiert worden sein, für die nur Begrenzungsboxen von Instanzen von Objekten bekannt sind, ohne zu berücksichtigen, zu welcher eine solche Instanz eines Objekts gehören könnte.From the features of the input image, a localization network determines locations of instances of objects in the input image. These locations can be defined as bounding boxes, for example. The location network may have been trained in any suitable manner. For example, it may have been trained on training input images for which only bounding boxes of instances of objects are known, without considering to which such instance of an object might belong.

Für jede einer Mehrzahl von Klassen wird ein Satz von Prototypmerkmalen erhalten, der die jeweilige Klasse repräsentiert. Diese Prototypmerkmale befinden sich in derselben Domäne wie die Eingabebildmerkmale. Das heißt, ein Vergleich zwischen den Prototypmerkmalen und den Eingabebildmerkmalen ist aussagekräftig.For each of a plurality of classes, a set of prototype features is obtained that represents the respective class. These prototype features are in the same domain as the input image features. That is, a comparison between the prototype features and the input image features is meaningful.

Die Prototypmerkmale können von jeder geeigneten Quelle erhalten werden. Beispielsweise können die Prototypmerkmale in Verbindung mit den jeweiligen Klassen als Ergebnis einer vorherigen Merkmalsextraktion aus Unterstützungsbildern mit Objekten, von denen bekannt ist, dass sie zu bestimmten Klassen gehören, vorgespeichert werden. Aber die Prototypmerkmale können beispielsweise auch über ein Netzwerk von einer Online-Quelle beschafft werden. Beispielsweise kann ein Anbieter fortlaufend Unterstützungsbilder auswerten, die neue Klassen repräsentieren, und Klassenprototypen erstellen, und Endbenutzer, die gegebene Eingabebilder klassifizieren möchten, können alle verfügbaren Klassenprototypen dynamisch aufrufen.The prototype features can be obtained from any suitable source. For example, the prototype features associated with the respective classes can be pre-stored as a result of a previous feature extraction from support images with objects known to belong to certain classes. But the prototype features can also be obtained over a network from an on-line source, for example. For example, a provider can continually evaluate support images representing new classes and create class prototypes, and end users wishing to classify given input images can dynamically invoke all available class prototypes.

Aus den Eingabebildmerkmalen in Kombination mit den Prototypmerkmalen werden Klassen bestimmt, zu denen die Instanzen von Objekten gehören. Zu diesem Zweck kann jeder sinnvolle Vergleich zwischen Eingabebildmerkmalen einerseits und Klassenprototypmerkmalen andererseits oder jede geeignete Verarbeitung, die sowohl Eingabebildmerkmale als auch Klassenprototypmerkmale berücksichtigt, verwendet werden. Beispielsweise kann eine solche Verarbeitung unter Verwendung eines Klassifizierungsnetzwerks durchgeführt werden, das durch Training von Eingabebildern mit bekannten Instanzen von Objekten, die zu bekannten Klassen gehören, trainiert werden kann.From the input image features in combination with the prototype features, classes to which the instances of objects belong are determined. Any meaningful comparison between input image features on the one hand and class prototype features on the other, or any suitable processing that takes both input image features and class prototype features into account, may be used for this purpose. For example, such processing can be performed using a classification network, which can be trained by training input images with known instances of objects belonging to known classes.

Ein wesentlicher Vorteil dieses Verfahrens besteht darin, dass einerseits eine generische Arbeit bezüglich Klassen, zu denen Objektinstanzen gehören können, und andererseits eine klassenspezifische Arbeit klar unterschieden wird. Die Extraktion von Merkmalen durch das Merkmalsextraktionsnetzwerk ist die gleiche, unabhängig von den konkreten Klassen, zu denen Objektinstanzen gehören könnten. Es ist eine „Grundkompetenz“, die auf jede Art von Objekt angewendet werden kann. Ebenso muss eine ohne Berücksichtigung von Klassen, zu denen lokalisierte Instanzen von Objekten gehören könnten, trainierte Lokalisierung von Objektinstanzen nicht geändert werden, wenn das Verfahren auf die Erkennung von Objekten neuer Klassen erweitert werden soll.A major advantage of this method is that, on the one hand, there is a clear distinction between generic work relating to classes, to which object instances can belong, and, on the other hand, class-specific work. The extraction of features by the feature extraction network is the same regardless of the particular classes to which object instances might belong. It is a “core skill” that can be applied to any type of object. Likewise, a localization of object instances trained without regard to classes to which localized instances of objects might belong does not have to be changed if the method is to be extended to the detection of objects of new classes.

Um neue Klassen aufzunehmen, muss nur die Bestimmung von Klassen basierend auf Eingabebildmerkmalen in Kombination mit Prototypmerkmalen geändert werden. Je nachdem, wie diese Bestimmung durchgeführt wird, kann das Hinzufügen eines neuen Satzes von Prototypmerkmalen für eine neue Klasse zu einer bestehenden Bibliothek oder Datenbank von Zuordnungen zwischen Klassen und Prototypmerkmalen ausreichen. Dies kann unter Verwendung von nur wenigen Beispielen der neuen Klasse erfolgen. Selbst wenn das Bestimmen von einer trainierbaren Entität durchgeführt wird, wie etwa einem Klassifizierungsnetzwerk, und dieses Klassifizierungsnetzwerk ein weiteres Training erfordert, um neue Klassen von Objekten aufzunehmen, kann dies unter Verwendung von nur wenigen Beispielen von Objekten, die zu den neuen Klassen gehören, erfolgen. Es ist möglicherweise nicht einmal notwendig, für dieses weitere Training physisch neue Bilder zu erfassen. Falls vorhandene Trainingseingabebilder beispielsweise bereits Instanzen von Objekten zeigen, die zu den neuen Klassen gehören, können für das weitere Training aus den vorhandenen Trainingseingabebildern Unterstützungsbilder, die nur diese Instanzen zeigen, herausgeschnitten werden.To accommodate new classes, only the determination of classes based on input image features in combination with prototype features needs to be changed. Depending on how this determination is made, adding a new set of prototype features for a new class to an existing library or database of class-to-prototype feature associations may be sufficient. This can be done using just a few instances of the new class. Even if the determining is performed by a trainable entity, such as a classification network, and that classification network requires further training to accommodate new classes of objects, this can be done using just a few examples of objects belonging to the new classes . It may not even be necessary to physically acquire new images for this further training. For example, if existing training input images already show instances of objects belonging to the new classes, support images showing only these instances can be cut out of the existing training input images for further training.

Das heißt, das vorliegende Verfahren ist besonders für eine Objekterkennung mit wenigen Runden geeignet, wobei die Fähigkeit zur Erkennung von Objektinstanzen neuer Klassen mit nur wenigen beispielhaften Objektinstanzen der neuen Klassen hinzugefügt werden soll.That is, the present method is particularly suited to low-round object recognition, adding the ability to recognize object instances of new classes with only a few example object instances of the new classes.

Eine beispielhafte Anwendung ist die Aktualisierung einer Erkennung von Verkehrszeichen in einem zumindest teilautomatisierten Fahrzeug. Von Zeit zu Zeit werden neue Verkehrszeichen erzeugt und bestehende Verkehrszeichen geändert oder abgeschafft. In den meisten Fällen umfasst eine Aktualisierung nur wenige neue Verkehrszeichen. Da diese erst noch weit zu verbreiten sind, stehen zudem nur wenige Trainingsbeispiele für Verkehrssituationen mit diesen neuen Verkehrszeichen zur Verfügung. Mit dem hier beschriebenen Verfahren kann der Bildklassifizierer und/oder Objektdetektor darauf trainiert werden, die neuen Verkehrszeichen zu erkennen.An exemplary application is the updating of a recognition of traffic signs in an at least partially automated vehicle. From time to time, new traffic signs are created and existing traffic signs are modified or eliminated. In most cases, an update includes only a few new traffic signs. Since these are yet to be widely used, only a few training examples for traffic situations with these new traffic signs are available. With the method described here, the image classifier and/or object detector can be trained to recognize the new traffic sign.

Eine andere beispielhafte Anwendung ist die Aktualisierung einer Objekterkennung eines Roboters, der unterschiedliche Objekte handhaben soll. Beispielsweise kann ein solcher Roboter auf einer Bestückungs-Fertigungslinie verwendet werden. Wenn ein neues Teil eingeführt wird, muss dem Roboter beigebracht werden, dieses neue Teil zu handhaben.Another exemplary application is updating an object recognition of a robot that is to handle different objects. For example, such a robot can be used on an assembly line. When a new part is introduced, the robot must be taught to handle this new part.

Da insbesondere die neuen Klassen auf transparente Weise an neue Klassenprototypen gebunden sind, ist das Risiko, dass ein weiteres Training die Fähigkeit, zuvor erlernte Klassen von Objekten zu erkennen, auf unbeabsichtigte Weise beeinträchtigt, sehr gering. Das heißt, das Hinzufügen eines neuen Verkehrszeichens zu einer Erkennung von Verkehrszeichen wird nicht bewirken, dass das Erkennungssystem ältere, aber wichtige Zeichen, wie etwa ein Stoppschild, übersieht. Ebenso führt das Hinzufügen eines neuen Teils zum Repertoire eines Bestückungsroboters nicht dazu, dass er vergisst, wie er bereits bekannte Teile handhabt.In particular, since the new classes are bound to new class prototypes in a transparent manner, the risk that further training will unintentionally affect the ability to recognize previously learned classes of objects is very small. That is, adding a new traffic sign to a traffic sign recognizer will not cause the recognition system to miss older but important signs, such as a stop sign. Likewise, adding a new part to a pick and place robot's repertoire does not cause it to forget how to handle familiar parts.

Ein weiterer Vorteil des vorliegenden Verfahrens besteht darin, dass es leicht modularisiert werden kann. Das Extrahieren von Merkmalen durch das Merkmalsextraktionsnetzwerk, das Evaluieren von Orten durch das Lokalisierungsnetzwerk, und das Bestimmen der Klassen müssen nicht von derselben Entität durchgeführt werden. Beispielsweise können einige oder alle Aufgaben auf eine Cloud-Plattform ausgelagert werden. Beispielsweise kann die Cloud-Plattform als ein erster Mikrodienst das Extrahieren von Merkmalen aus einem gegebenen Eingabebild anbieten, das als solches für viele Zwecke verwendbar ist. Als zweiter Microdienst kann dann die Lokalisierung von Objektinstanzen auf Basis von Eingabebildmerkmalen angeboten werden, die als solche auch für viele Zwecke nutzbar ist. Eine Funktionalität zum Vergleichen von zwei Sätzen von Merkmalen in der Domäne extrahierter Eingabebildmerkmale kann als dritter Mikrodienst angeboten werden und zum Vergleichen von Eingabebildmerkmalen mit Klassenprototypmerkmalen verwendet werden.Another advantage of the present method is that it can be easily modularized. Extracting features through the feature extraction network, evaluating locations through the localization network, and determining the classes need not be performed by the same entity. For example, some or all tasks can be outsourced to a cloud platform. For example, the cloud platform can offer as a first microservice the extraction of features from a given input image, which as such can be used for many purposes. The localization of object instances based on input image features can then be offered as a second microservice, which as such can also be used for many purposes. A functionality to compare two sets of features in the domain of extracted input image features can be offered as a third microservice and used to compare input image features with class prototype features.

Das vorliegende Verfahren implementiert einen einstufigen Detektor, der auf dem gesamten Bild arbeitet, und Orte (z. B. Begrenzungskasten-Offsets) und Klassen (z. B. Klassenwahrscheinlichkeiten oder andere Klassifizierungsbewertungen) direkt vorhersagt. Im Gegensatz dazu nutzen zweistufige Detektoren zuerst ein Bereichsvorschlagsnetzwerk, um klassenagnostische Vorschläge von Bereichen zu erzeugen, die höchstwahrscheinlich ein Objekt enthalten, und verfeinern dann die Vorschläge für Boxen und klassifizieren sie entweder als Objekt oder als Hintergrund. Einstufige Detektoren sind viel einfacher in eingebetteten Systemen zu implementieren und unterstützen auch schnellere Inferenzzeiten als zweistufige Detektoren.The present method implements a single-stage detector that operates on the entire image and directly predicts locations (e.g., bounding box offsets) and classes (e.g., class probabilities or other classification scores). In contrast, two-stage detectors first use an area suggestion network to generate class-agnostic suggestions of areas most likely to contain an object contained, and then refine the suggestions for boxes and classify them as either an object or a background. Single-stage detectors are much easier to implement in embedded systems and also support faster inference times than two-stage detectors.

In einer besonders vorteilhaften Ausführungsform umfasst das Erhalten von Prototypmerkmalen ein Extrahieren der Prototypmerkmale durch das Merkmalsextraktionsnetzwerk aus zumindest einem Unterstützungsbild mit einem Objekt, das zu der jeweiligen Klasse gehört. Idealerweise zeigt ein solches Unterstützungsbild nur genau eine Objektinstanz der jeweiligen Klasse und keine Objekte anderer Klassen. Die eine Objektinstanz sollte das Unterstützungsbild möglichst auch vollständig ausfüllen.In a particularly advantageous embodiment, obtaining prototype features includes extracting the prototype features by the feature extraction network from at least one support image with an object belonging to the respective class. Ideally, such a support image only shows one object instance of the respective class and no objects of other classes. The one object instance should fill out the support image as completely as possible.

Beispielsweise können Prototypmerkmale aus mehreren Unterstützungsbildern, die zu ein und derselben Klasse gehören, extrahiert werden und zu einem Satz von Prototypmerkmalen aggregiert werden, der diese Klasse repräsentiert. Beispielsweise können Merkmalskarten, die durch Extrahieren von Merkmalen aus Unterstützungsbildern erhalten werden, nach Interessenbereich, ROI, Bündelung und/oder globale Durchschnittsbündelung, GAP, verarbeitet werden. Mehrere Merkmalskarten, die so aus mehreren Unterstützungsbildern erhalten wurden, können dann durch Mittelwertbildung aggregiert werden.For example, prototype features can be extracted from multiple support images belonging to one and the same class and aggregated into a set of prototype features that represents that class. For example, feature maps obtained by extracting features from support images may be processed by area of interest, ROI, clustering, and/or global average clustering, GAP. Multiple feature maps thus obtained from multiple support images can then be aggregated by averaging.

In einer besonders vorteilhaften Ausführungsform wird als Reaktion auf die Bestimmung, dass Eingabebildmerkmale, die einer Instanz eines Objekts entsprechen, Prototypmerkmalen einer Klasse gemäß einem vorbestimmten Kriterium ähnlich sind, diese Instanz eines Objekts als zu der Klasse gehörend, die durch diesen Prototyp repräsentiert wird, bestimmt. Diese Ähnlichkeit kann unter Verwendung einer beliebigen geeigneten Funktion oder eines trainierbaren Modells evaluiert werden. Insbesondere kann die Ähnlichkeit zwischen Eingabebildmerkmalen und Prototypmerkmalen gemäß einem gemessenen Abstand zwischen den Eingabebildmerkmalen und den Prototypmerkmalen in dem Merkmalsraum evaluiert werden. Wenn eine feste, nicht trainierbare Funktion zum Messen des Abstands verwendet wird, kann das Hinzufügen einer neuen Objektklasse auf das Hinzufügen eines neuen Klassenprototyps zu einer bestehenden Bibliothek oder Datenbank von Prototypen reduziert werden.In a particularly advantageous embodiment, in response to determining that input image features corresponding to an instance of an object are similar to prototype features of a class according to a predetermined criterion, that instance of an object is determined to belong to the class represented by that prototype . This similarity can be evaluated using any suitable function or trainable model. In particular, the similarity between input image features and prototype features can be evaluated according to a measured distance between the input image features and the prototype features in the feature space. When a fixed, non-trainable function is used to measure distance, adding a new object class can be reduced to adding a new class prototype to an existing library or database of prototypes.

In einer weiteren besonders vorteilhaften Ausführungsform umfasst das Bestimmen der Klassen ein Evaluieren von Fusionen von Eingabebildmerkmalen mit Prototypmerkmalen, die unterschiedliche Klassen repräsentieren, durch ein Klassifizierungsnetzwerk. Wenn beispielsweise die Eingabebildmerkmale auf irgendeine Weise den Klassenprototypmerkmalen ähnlich sind, kann dies bewirken, dass die Fusion dieser Merkmale bestimmte Eigenschaften aufweist, die dann durch das Klassifizierungsnetzwerk erkannt werden können. Dies ist in etwa analog zu einem Nachweis einer chemischen Substanz in einer Probe mittels eines Indikatorreagenzes für die gesuchte Substanz, das mit der Probe in Kontakt gebracht wird. Ist die gesuchte Substanz in der Probe vorhanden, verfärbt sich das Indikatorreagenz. Vorzugsweise empfangen die Lokalisierungsnetzwerke noch immer nur die Eingabebildmerkmale, aber nicht die Prototypmerkmale. Es wurde experimentell festgestellt, dass das Einspeisen von Fusionen von Eingabebildmerkmalen mit Prototypmerkmalen in das Lokalisierungsnetzwerk die Genauigkeit der Lokalisierung beeinträchtigt.In a further particularly advantageous embodiment, the determination of the classes includes an evaluation of fusions of input image features with prototype features that represent different classes using a classification network. For example, if the input image features are in some way similar to the class prototype features, this may cause the fusion of those features to have certain properties that can then be recognized by the classification network. This is roughly analogous to detecting a chemical substance in a sample using an indicator reagent for the substance being sought that is brought into contact with the sample. If the substance you are looking for is present in the sample, the indicator reagent changes colour. Preferably, the localization networks still only receive the input image features, but not the prototype features. It has been found experimentally that feeding fusions of input image features with prototype features into the localization network degrades the accuracy of the localization.

Ein beispielhaftes Verfahren zum Berechnen der Fusion für jede Klasse ist ein Hadamard-Produkt (elementweises Produkt) der Eingabebildmerkmale mit den Prototypmerkmalen, die diese Klasse repräsentieren. Eine Merkmalskomponente der Fusion wird nur dann ungleich Null sein, wenn sowohl die jeweilige Komponente der Eingabebildmerkmale als auch die jeweilige Komponente der Prototypmerkmale ungleich Null sind.An exemplary method for computing the fusion for each class is a Hadamard product (element-wise product) of the input image features with the prototype features representing that class. A feature component of the fusion will be non-zero only if both the respective component of the input image features and the respective component of the prototype features are non-zero.

In einer besonders vorteilhaften Ausführungsform umfasst das Merkmalsextraktionsnetzwerk zumindest

• eine Mehrzahl von Faltungsschichten, die Merkmalskarten ihrer jeweiligen Eingabe erzeugen, wobei diese Merkmalskarten eine geringere Dimensionalität als die jeweilige Eingabe haben;
• eine Mehrzahl von Upsampling-Schichten, die Merkmalskarten auf eine höhere Dimensionalität aufwärtssampeln; und
• zumindest eine laterale Verbindung, die Informationen von einer durch eine Faltungsschicht erzeugten Merkmalskarte in eine von einer Upsampling-Schicht erzeugte aufwärtsgesampelte Merkmalskarte der gleichen Dimensionalität überträgt.

In a particularly advantageous embodiment, the feature extraction network comprises at least

• a plurality of convolution layers that generate feature maps of their respective input, these feature maps having lower dimensionality than the respective input;
• a plurality of upsampling layers that upsample feature maps to a higher dimensionality; and
• at least one lateral connection that transfers information from a feature map generated by a convolutional layer into an upsampled feature map of the same dimensionality generated by an upsampling layer.

Ein solches Merkmalsextraktionsnetzwerk erzeugt mehrskalige Merkmale, so dass Objektinstanzen einer Vielzahl von Größen erkannt werden können. Insbesondere bewahrt die laterale Verbindung Informationen, die andernfalls verloren gehen würden, wenn die Dimensionalität des Eingabebildes zuerst durch die Faltungen reduziert und dann durch das Upsampling wieder erhöht wird.Such a feature extraction network generates multi-scale features such that object instances of a variety of sizes can be recognized. In particular, the lateral connection preserves information that would otherwise be lost if the dimensionality of the input image is first reduced by the convolutions and then increased again by upsampling.

Der letztendliche Zweck des Objekterkennungsverfahrens besteht darin, technische Systeme basierend auf den lokalisierten und klassifizierten Objektinstanzen automatisch zu steuern. Daher wird in einer weiteren vorteilhaften Ausgestaltung aus im Zuge des Verfahrens bestimmten Orten und Klassen von Objektinstanzen zumindest ein Ansteuersignal berechnet. Mit dem Ansteuersignal kann dann ein Fahrzeug, ein Roboter, ein Überwachungssystem, ein medizinisches Bildgebungssystem und/oder ein Qualitätsprüfsystem angesteuert werden.The ultimate purpose of the object detection method is to provide engineering systems based on the located and classified Automatically control object instances. Therefore, in a further advantageous embodiment, at least one control signal is calculated from locations and classes of object instances determined in the course of the method. A vehicle, a robot, a monitoring system, a medical imaging system and/or a quality inspection system can then be controlled with the control signal.

Wie zuvor diskutiert, besteht ein Vorteil des oben beschriebenen Verfahrens darin, dass es leicht aktualisiert werden kann, um Objektinstanzen neuer Klassen zu erkennen, wenn die Notwendigkeit entsteht. Die Erfindung stellt daher auch ein Verfahren zum Trainieren einer Kombination aus zumindest einem Lokalisierungsnetzwerk und einem Klassifizierungsnetzwerk zur Verwendung in dem oben beschriebenen Verfahren bereit.As previously discussed, an advantage of the method described above is that it can be easily updated to recognize object instances of new classes as the need arises. The invention therefore also provides a method for training a combination of at least one localization network and one classification network for use in the method described above.

Im Zuge dieses Verfahrens wird zumindest ein Satz von Abfragebildern bereitgestellt. Jedes solche Abfragebild umfasst eine oder mehrere Instanzen von Objekten, und jede solche Instanz ist mit einem Klassenlabel und einem Ort der Objektinstanz gekennzeichnet.At least one set of query images is provided as part of this process. Each such query image includes one or more instances of objects, and each such instance is identified with a class label and a location of the object instance.

Außerdem wird zumindest ein Satz von Unterstützungsbildern bereitgestellt. Jedes Unterstützungsbild umfasst eine Instanz eines Objekts, das zu einer Klasse gehört. Beispielsweise können, wie oben diskutiert, Unterstützungsbilder von Abfragebildern abgeschnitten und hochskaliert werden.At least one set of support images is also provided. Each support image includes an instance of an object belonging to a class. For example, as discussed above, support images may be cropped and scaled up from query images.

Für jede Klasse werden Prototypmerkmale, die diese Klasse repräsentieren, aus einem oder mehreren Unterstützungsbildern mit einem zu dieser Klasse gehörenden Objekt durch ein Merkmalsextraktionsnetzwerk extrahiert. Ebenso wird aus jedem Abfragebild ein Satz von Abfragebildmerkmalen mit dem Merkmalsextraktionsnetzwerk extrahiert. Das Extrahieren von Merkmalen einerseits aus Unterstützungsbildern und andererseits aus Abfragebildern kann ein und dieselbe Instanz des Merkmalsextraktionsnetzwerks gemeinsam nutzen. Aber beide Aufgaben können auch getrennte Instanzen des Merkmalsextraktionsnetzwerks verwenden, die basierend auf identischen Modellparametern (z. B. neuronalen Netzgewichten) arbeiten. Das heißt, die getrennten Instanzen können als siamesische Instanzen betrachtet werden.For each class, prototype features representing that class are extracted from one or more support images with an object belonging to that class by a feature extraction network. Likewise, a set of query image features is extracted from each query image with the feature extraction network. Extracting features from support images on the one hand and from query images on the other hand can share one and the same instance of the feature extraction network. But both tasks can also use separate instances of the feature extraction network operating based on identical model parameters (e.g. neural network weights). That is, the separate instances can be viewed as Siamese instances.

Aus den Abfragebildmerkmalen werden durch das zu trainierende Lokalisierungsnetzwerk Orte von Objekten in dem Abfragebild evaluiert. Aus den Abfragebildmerkmalen in Kombination mit Prototypmerkmalen bestimmt das Klassifizierungsnetzwerk Klassen, zu denen die Instanzen von Objekten in dem Abfrage gehören.The locations of objects in the query image are evaluated from the query image features by the localization network to be trained. From the query image features combined with prototype features, the classification network determines classes to which the instances of objects in the query belong.

Mittels zumindest einer vorgegebenen Verlustfunktion wird bewertet

• wie gut die evaluierten Orte von Objektinstanzen den Orten, mit denen Objektinstanzen im Abfragebild gekennzeichnet sind, entsprechen, und
• wie gut die bestimmten Klassen der Instanzen von Objekten den Klassenlabels entsprechen, mit denen die Instanzen von Objekten gekennzeichnet sind.

An evaluation is carried out using at least one predetermined loss function

• how well the evaluated locations of object instances correspond to the locations with which object instances are labeled in the query image, and
• how well the particular classes of the instances of objects correspond to the class labels with which the instances of objects are labeled.

Parameter, die das Verhalten des Lokalisierungsnetzwerks und des Klassifizierungsnetzwerks charakterisieren, werden so optimiert, dass sich die Bewertung durch die zumindest eine Verlustfunktion wahrscheinlich verbessert. Prinzipiell ist es möglich, zunächst das Lokalisierungsnetzwerk vollständig zu optimieren und dann das Klassifizierungsnetzwerk zu optimieren. Vorteilhafter ist es jedoch, die Optimierung beider Netzwerke zu kombinieren, da eine gegenseitige Abhängigkeit besteht. Wenn die Lokalisierung von Objektinstanzen zuverlässiger gemacht wird, z. B. wenn Begrenzungsrahmen genauer den tatsächlichen Orten von Objektinstanzen entsprechen, verbessert dies auch die Leistung der Klassifizierung.Parameters that characterize the behavior of the localization network and the classification network are optimized in such a way that the assessment by the at least one loss function is likely to improve. In principle it is possible to first fully optimize the localization network and then to optimize the classification network. However, it is more advantageous to combine the optimization of both networks, since there is a mutual dependency. If the location of object instances is made more reliable, e.g. For example, having bounding boxes more closely match the actual locations of object instances also improves classification performance.

Das Training der Klassifizierung, die auf Abfragebildmerkmalen in Kombination mit Prototypmerkmalen basiert, lernt effektiv ein Abstandsmaß in dem Merkmalsraum, selbst wenn ein solches Abstandsmaß nicht explizit formuliert ist. Damit qualifiziert sich das Trainingsverfahren als Meta-Lernverfahren.Classification training based on query image features in combination with prototype features effectively learns a distance measure in the feature space, even if such a distance measure is not explicitly formulated. This qualifies the training process as a meta-learning process.

Das Merkmalsextraktionsnetzwerk kann, so wie es ist, im bereits trainierten Zustand verwendet werden. Aber in einer besonders vorteilhaften Ausführungsform umfasst das Training auch das Merkmalsextraktionsnetzwerk. Das heißt, Parameter, die das Verhalten des Merkmalsextraktionsnetzwerks charakterisieren, werden ebenfalls so optimiert, dass sich die Bewertung durch die Verlustfunktion wahrscheinlich verbessert.The feature extraction network can be used as it is in the already trained state. But in a particularly advantageous embodiment, the training also includes the feature extraction network. That is, parameters characterizing the behavior of the feature extraction network are also optimized in such a way that the assessment by the loss function is likely to improve.

In einem Beispiel kombiniert die Verlustfunktion den Max-Margin-Verlust und den RetinaNet-Fokusverlust, um dem Merkmalsextraktionsnetzwerk dabei zu helfen, ein Repräsentationslerner zu sein, anstatt durch das Lernen von ankerbezogenen Informationen abgelenkt zu werden.In one example, the loss function combines the max margin loss and the RetinaNet focus loss to help the feature extraction network be a representation learner instead of being distracted by learning anchor-related information.

In einer weiteren besonders vorteilhaften Ausführungsform wird das Training zweistufig durchgeführt. In einer ersten Stufe werden ein erster, größerer Satz von Abfragebildern und ein erster, größerer Satz von Unterstützungsbildern mit Objektinstanzen aus einem Satz von Basisklassen C_b verwendet. In dieser ersten Stufe werden die Parameter, die das Verhalten des Merkmalsextraktionsnetzwerks charakterisieren, zusammen mit den Parametern optimiert, die das Verhalten des Lokalisierungsnetzwerks und des Klassifizierungsnetzwerks charakterisieren. In einer zweiten Stufe werden ein zweiter, kleinerer Satz von Abfragebildern und ein zweiter, kleinerer Satz von Unterstützungsbildern mit Objektinstanzen aus einem Satz neuer Klassen C_n verwendet. In dieser zweiten Stufe werden die Parameter eingefroren, die das Verhalten des Merkmalsextraktionsnetzwerks charakterisieren. Nur die Parameter, die das Verhalten des Lokalisierungsnetzes und des Klassifizierungsnetzes charakterisieren, werden weiter trainiert.In a further particularly advantageous embodiment, the training is carried out in two stages. In a first stage, a first, larger set of query images and a first, larger set of support images with object instances from a set of base classes C _b are used. In this first stage, the Parameters characterizing the behavior of the feature extraction network optimized along with the parameters characterizing the behavior of the localization network and the classification network. In a second stage, a second, smaller set of query images and a second, smaller set of support images with object instances from a set of new classes C _n are used. In this second stage, the parameters that characterize the behavior of the feature extraction network are frozen. Only the parameters that characterize the behavior of the localization mesh and the classification mesh are further trained.

Auf diese Weise wird, wenn das Merkmalsextraktionsnetzwerk zuvor basierend auf einer großen Menge von Beispielen trainiert wurde, das glaubwürdige Wissen, das aus dieser großen Menge von Beispielen gelernt wurde, nicht basierend auf einer viel kleineren Menge von Beispielen umgestürzt.In this way, if the feature extraction network was previously trained based on a large set of examples, the credible knowledge learned from that large set of examples will not be overturned based on a much smaller set of examples.

Optional können in der zweiten Stufe auch die Parameter eingefroren werden, die das Verhalten des Lokalisierungsnetzes charakterisieren. Wie zuvor diskutiert, ist die Lokalisierung von Objektinstanzen eine ziemlich allgemeine Fähigkeit, die nicht von den Klassen, zu denen Objektinstanzen gehören, abhängig ist. Wenn diese Fähigkeit an einem großen Satz von Eingabebildern trainiert wurde, kann das Ergebnis dieses Trainings daher als glaubwürdig angesehen werden. Es ist dann möglicherweise nicht plausibel, Eigenschaften der Lokalisierung basierend auf einer viel kleineren späteren Menge von Bildern ohne einen bestimmten Grund zu ändern.Optionally, the parameters that characterize the behavior of the localization network can also be frozen in the second stage. As previously discussed, object instance localization is a fairly general capability that does not depend on the classes to which object instances belong. Therefore, if this skill has been trained on a large set of input images, the result of this training can be considered credible. It may then not be plausible to change properties of the localization based on a much smaller later set of images without a specific reason.

Optional kann in der zweiten Stufe ein Vordergrund-Neugewichtungsschema verwendet werden, um die Gewichtung von Vordergrundbereichen zu erhöhen. Dies kann beispielsweise erreicht werden, indem ein höheres Alpha in der fokalen Verlustkomponente der Verlustfunktion verwendet wird, während ein Größenjittern der Eingabebilder durchgeführt wird, um alle Ebenen des pyramidalen Merkmalsnetzwerks abzudecken. Die mehrskalige Funktion führt zu mehr Vordergrundabtastungen, und daher konzentriert sich ein höheres Alpha mehr auf solche Vordergrundabtastungen als auf die Hintergrundabtastungen.Optionally, a foreground re-weighting scheme can be used in the second stage to increase the weight of foreground areas. This can be achieved, for example, by using a higher alpha in the focal loss component of the loss function while size-jittering the input images to cover all levels of the pyramidal feature network. The multi-scale function results in more foreground samples, and therefore higher alpha focuses more on those foreground samples than on the background samples.

In einer weiteren vorteilhaften Ausführungsform ist der Satz von Unterstützungsbildern zur Verwendung mit zumindest einem Abfragebild so aufgebaut, dass er Objektinstanzen enthält, die nur zu einer Teilmenge der Klassen gehören, mit denen Objektinstanzen in dem Abfragebild gekennzeichnet sind. Beispielsweise kann ein Bild einer komplexen Verkehrssituation Objektinstanzen umfassen, die zu sehr vielen Klassen gehören. Das Begrenzen der Anzahl der für das Training zu verwendenden Klassen kann die Hardware- und Rechenzeitanforderungen für das Training verringern. Beispielsweise kann für jede Iteration des Trainings die Teilmenge der zu berücksichtigenden Klassen zufällig ausgewählt werden.In a further advantageous embodiment, the set of supporting images for use with at least one query image is constructed to contain object instances that belong to only a subset of the classes with which object instances are labeled in the query image. For example, an image of a complex traffic situation can include object instances that belong to a large number of classes. Limiting the number of classes to use for training can reduce hardware and computational time requirements for training. For example, for each iteration of training, the subset of classes to consider can be randomly selected.

In einer anderen vorteilhaften Ausführungsform ist der Satz von Unterstützungsbildern zur Verwendung mit zumindest einem Abfragebild so aufgebaut, dass er Objektinstanzen, die zu einer Vereinigung der Klassen gehören, mit denen Objektinstanzen in dem Abfragebild gekennzeichnet sind, und zusätzliche zufällig ausgewählte Klassen enthält. Beispielsweise kann der mit jedem Abfragebild zu verwendende Satz von Unterstützungsbildern so hergestellt werden, dass er Unterstützungsbilder aus einer gegebenen Anzahl N unterschiedlicher Klassen mit einer gegebenen Anzahl K von Unterstützungsbildern pro Klasse enthält. Dieser neue Unterstützungs-Abfrage-Konstruktionsalgorithmus unterstützt die Erkennung mehrerer Klassen pro einzelnem Abfragebild. Dies führt wiederum zu mehr Vordergrunderkennungen mit weniger Hintergrundabtastungen, wodurch die Gesamtleistung des Detektors verbessert wird.In another advantageous embodiment, the set of supporting images for use with at least one query image is constructed to include object instances belonging to a union of the classes with which object instances are labeled in the query image and additional classes selected at random. For example, the set of support images to be used with each query image can be made to include support images from a given number N of different classes, with a given number K of support images per class. This new support query construction algorithm supports multiple class recognition per single query image. This in turn results in more foreground detections with fewer background samples, improving the overall performance of the detector.

Die Verwendung von Sätzen von Unterstützungsbildern mit mehreren Klassen trainiert den Objektdetektor besser für Aufgaben mit mehreren Klassen, bei denen viele Objekte, die zu unterschiedlichen Klassen gehören, voneinander unterschieden werden müssen. Die oben genannten Anwendungen, nämlich die Beurteilung von Verkehrssituationen und die Steuerung eines Bestückungsroboters, sind typische Beispiele für solche Mehrklassenprobleme.Using multi-class sets of support images better trains the object detector for multi-class tasks where many objects belonging to different classes need to be distinguished from each other. The applications mentioned above, namely the assessment of traffic situations and the control of an assembly robot, are typical examples of such multi-class problems.

Die Verfahren können ganz oder zumindest teilweise computerimplementiert sein. Die Erfindung betrifft daher auch ein Computerprogramm mit maschinenlesbaren Anweisungen, die, wenn sie von einem oder mehreren Computern oder Recheninstanzen ausgeführt werden, den einen oder die mehreren Computer veranlassen, ein oben beschriebenes Verfahren auszuführen. Recheninstanzen weisen beispielsweise virtuelle Maschinen, serverlose Rechenumgebungen, und andere, in einer Cloud erhältliche Rechenressourcen auf.The methods can be fully or at least partially computer-implemented. The invention therefore also relates to a computer program with machine-readable instructions which, when executed by one or more computers or computing entities, cause the one or more computers to carry out a method described above. Computing instances include, for example, virtual machines, serverless computing environments, and other computing resources available in a cloud.

Die Erfindung betrifft auch ein nichtflüchtiges Speichermedium und/oder Downloadprodukt mit dem Computerprogramm. Ein Download-Produkt ist ein Produkt, das in einem Online-Shop zur sofortigen Auftragsabwicklung per Download verkauft werden kann. Die Erfindung stellt auch einen oder mehrere Computer und/oder Recheninstanzen mit dem einen oder den mehreren Computerprogrammen und/oder mit dem einen oder den mehreren nichtflüchtigen maschinenlesbaren Speichermedien und/oder Download-Produkten bereit.The invention also relates to a non-volatile storage medium and/or download product with the computer program. A download product is a product that can be sold in an online store for immediate order fulfillment via download. The invention also provides one or more computers and/or computing entities with the one or more computer programs and/or with the one or more non-volatile programs machine-readable storage media and/or download products.

Im Folgenden wird die Beschreibung anhand von Figuren veranschaulicht, ohne den Umfang der Erfindung einschränken zu wollen.The description is illustrated below with reference to figures, without wishing to restrict the scope of the invention.

Figurenlistecharacter list

Die Figuren zeigen:

1: eine beispielhafte Ausführungsform des Verfahrens 100 zur Erkennung und Klassifizierung von Instanzen von Objekten in einem Eingabebild;
2: eine beispielhafte Ausführungsform des Verfahrens 200 zum Trainieren einer Kombination aus zumindest einem Lokalisierungsnetzwerk und einem Klassifizierungsnetzwerk;
3: eine beispielhafte Darstellung einer Anwendung des Verfahrens 200.

The figures show:

1 FIG. 1: an exemplary embodiment of the method 100 for detecting and classifying instances of objects in an input image;
2 Figure 2: an exemplary embodiment of the method 200 for training a combination of at least one location network and one classification network;
3 : an exemplary representation of an application of the method 200.

1 ist ein schematisches Flussdiagramm einer Ausführungsform des Verfahrens 100 zur Erkennung und Klassifizierung von Instanzen von Objekten 2 in einem Eingabebild 1. 1 FIG. 1 is a schematic flow diagram of an embodiment of the method 100 for detecting and classifying instances of objects 2 in an input image 1.

In Schritt 110 extrahiert ein Merkmalsextraktionsnetzwerk 10 einen Satz von Eingabebildmerkmalen 3 aus dem Eingabebild 1.In step 110, a feature extraction network 10 extracts a set of input image features 3 from the input image 1.

In Schritt 120 evaluiert ein Lokalisierungsnetzwerk 20 Orte 2a von Instanzen von Objekten 2 in dem Eingabebild 1 aus den Eingabebildmerkmalen 3.In step 120, a localization network 20 evaluates locations 2a of instances of objects 2 in the input image 1 from the input image features 3.

In Schritt 130 wird für jede einer Mehrzahl von Klassen 4a-4c ein Satz von Prototypmerkmalen 5a-5c in der Domäne der Eingabebildmerkmale 3a-3c, der die jeweilige Klasse 4a-4c repräsentiert, erhalten.In step 130, for each of a plurality of classes 4a-4c, a set of prototype features 5a-5c in the domain of input image features 3a-3c representing the respective class 4a-4c is obtained.

Insbesondere kann gemäß Block 131 das Merkmalsextraktionsnetzwerk 10 die Prototypmerkmale 5a-5c aus zumindest einem Unterstützungsbild 6 mit einem Objekt 2, das zu der jeweiligen Klasse 4a-4c gehört, extrahieren. Gemäß Block 131a können die Prototypmerkmale 5a-5c aus mehreren Unterstützungsbildern 6, die zu ein und derselben Klasse 4a-4c gehören, extrahiert und zu einem Satz von Prototypmerkmalen 5a-5c, der diese Klasse 4a-4c repräsentiert, aggregiert werden.In particular, according to block 131, the feature extraction network 10 can extract the prototype features 5a-5c from at least one support image 6 with an object 2 belonging to the respective class 4a-4c. According to block 131a, the prototype features 5a-5c can be extracted from a plurality of support images 6 belonging to one and the same class 4a-4c and aggregated into a set of prototype features 5a-5c representing this class 4a-4c.

In Schritt 140 werden aus den Eingabemerkmalen in Kombination mit den Prototypmerkmalen 5a-5c Klassen 4a-4c bestimmt, zu denen die Instanzen von Objekten 2 in dem Eingabebild 1 gehören.In step 140, classes 4a-4c to which the instances of objects 2 in the input image 1 belong are determined from the input features in combination with the prototype features 5a-5c.

Gemäß Block 141 kann geprüft werden, ob Eingabebildmerkmale 3, die einer Instanz eines Objekts 2 entsprechen, gemäß einem vorbestimmten Kriterium Prototypmerkmalen 5a-5c einer Klasse 4a-4c ähnlich sind. Wenn dies der Fall ist (Wahrheitswert 1), wird gemäß Block 142 bestimmt, dass diese Instanz eines Objekts 2 zu der durch den Prototyp 5a-5c repräsentierten Klasse 4a-4c gehört.According to block 141, it can be checked whether input image features 3 corresponding to an instance of an object 2 are similar to prototype features 5a-5c of a class 4a-4c according to a predetermined criterion. If this is the case (truth value 1), according to block 142 it is determined that this instance of an object 2 belongs to the class 4a-4c represented by the prototype 5a-5c.

Gemäß Block 143 können Fusionen 7a-7c von Eingabebildmerkmalen 3 mit Prototypmerkmalen 5a-5c, die unterschiedliche Klassen 4a-4c repräsentieren, durch ein Klassifizierungsnetzwerk 30 evaluiert werden, um die Klassen 4a-4c zu bestimmen. Gemäß Block 143a kann die Fusion 7a-7c für jede Klasse 4a-4c als ein Hadamard-Produkt der Eingabebildmerkmale 3 mit den Prototypmerkmalen 5a-5c, die diese Klasse 4a-4c repräsentieren, berechnet werden.According to block 143, fusions 7a-7c of input image features 3 with prototype features 5a-5c representing different classes 4a-4c can be evaluated by a classification network 30 to determine classes 4a-4c. According to block 143a, the fusion 7a-7c can be computed for each class 4a-4c as a Hadamard product of the input image features 3 with the prototype features 5a-5c representing that class 4a-4c.

In Schritt 150 wird zumindest ein Ansteuersignal 150a aus bestimmten Orten 2a und Klassen 4a-4c von Instanzen von Objekten 2 berechnet.In step 150, at least one control signal 150a is calculated from specific locations 2a and classes 4a-4c of instances of objects 2.

In Schritt 160a wird ein Fahrzeug 50, ein Roboter 60, ein Überwachungssystem 70, ein medizinisches Bildgebungssystem 80 und/oder ein Qualitätsprüfsystem 90 mit dem Ansteuersignal 150a angesteuert.In step 160a, a vehicle 50, a robot 60, a monitoring system 70, a medical imaging system 80 and/or a quality inspection system 90 is controlled with the control signal 150a.

2 ist ein schematisches Flussdiagramm einer Ausführungsform des Verfahrens 200 zum Trainieren einer Kombination aus zumindest einem Lokalisierungsnetzwerk 20 und einem Klassifizierungsnetzwerk 30 zur Verwendung in dem oben beschriebenen Verfahren 100. 2 Figure 12 is a schematic flow diagram of an embodiment of the method 200 for training a combination of at least one localization network 20 and one classification network 30 for use in the method 100 described above.

In Schritt 210 wird zumindest ein Satz von Abfragebildern 8 bereitgestellt. Jedes Abfragebild 8 umfasst eine oder mehrere Instanzen von Objekten 2. Jede solche Instanz ist mit einem Klassenlabel 2b* und einem Ort 2a* der Instanz des Objekts 2 gekennzeichnet.In step 210 at least one set of query images 8 is provided. Each query image 8 comprises one or more instances of objects 2. Each such instance is identified with a class label 2b* and a location 2a* of the object 2 instance.

In Schritt 220 wird zumindest ein Satz von Unterstützungsbildern 6 bereitgestellt. Jedes Unterstützungsbild 6 umfasst eine Instanz eines Objekts 2, das zu einer Klasse 4a-4c gehört.In step 220 at least one set of supporting images 6 is provided. Each support image 6 comprises an instance of an object 2 belonging to a class 4a-4c.

Gemäß Block 221 kann der Satz von Unterstützungsbildern 6 zur Verwendung aus zumindest einem Abfragebild 8 aufgebaut sein, um Instanzen von Objekten 2 zu enthalten, die nur zu einer Teilmenge der Klassen 4a-4c gehören, für die Instanzen von Objekten 2 in dem Abfragebild 8 ein Klassenlabel 2b* haben.According to block 221, the set of supporting images 6 for use can be constructed from at least one query image 8 to contain instances of objects 2 belonging only to a subset of classes 4a-4c for the instances of objects 2 in the query image 8 a have class label 2b*.

Gemäß Block 222 kann der Satz von Unterstützungsbildern 6 zur Verwendung mit zumindest einem Abfragebild 8 aufgebaut sein, um Instanzen von Objekten 2 zu enthalten, die zu einer Vereinigung der Klassen 4a-4c, für die Instanzen von Objekten 2 in dem Abfragebild 8 ein Klassenlabel 2b* haben, und zusätzlichen zufällig ausgewählten Klassen 4a-4c gehören.According to block 222, the set of support images 6 for use with at least one query image 8 may be constructed to contain instances of objects 2 belonging to a union of classes 4a-4c for which instances of object ten 2 in the query image 8 have a class label 2b*, and belong to additional randomly selected classes 4a-4c.

In Schritt 230 extrahiert ein Merkmalsextraktionsnetzwerk 10 für jede Klasse 4a-4c Prototypmerkmale 5a-5c, die diese Klasse 4a-4c repräsentieren, aus einem oder mehreren Unterstützungsbildern 6 mit einem Objekt 2, das zu dieser Klasse 4a-4c gehört.In step 230, a feature extraction network 10 extracts, for each class 4a-4c, prototype features 5a-5c representing that class 4a-4c from one or more support images 6 having an object 2 belonging to that class 4a-4c.

In Schritt 240 extrahiert das Merkmalsextraktionsnetzwerk 10 einen Satz von Abfragebildmerkmalen 9 aus jedem Abfragebild 8.In step 240, the feature extraction network 10 extracts a set of query image features 9 from each query image 8.

In Schritt 250 evaluiert das Lokalisierungsnetzwerk 20 Orte 2a von Instanzen von Objekten 2 in dem Abfragebild 8 aus den Abfragebildmerkmalen 9.In step 250, the localization network 20 evaluates locations 2a of instances of objects 2 in the query image 8 from the query image features 9.

In Schritt 260 bestimmt das Klassifizierungsnetzwerk 30 Klassen 4a-4c, zu denen die Instanzen von Objekten 2 in dem Abfragebild 8 gehören, aus Abfragebildmerkmalen 9 in Kombination mit Prototypmerkmalen 5a-5c.In step 260, the classification network 30 determines classes 4a-4c to which the instances of objects 2 in the query image 8 belong from query image features 9 in combination with prototype features 5a-5c.

In Schritt 270 wird mittels zumindest einer vorbestimmten Verlustfunktion L bewertet

• wie gut die bewerteten Orte 2a von Instanzen von Objekten 2 den Orten 2a*, mit denen Instanzen von Objekten 2 in dem Abfragebild (8) gekennzeichnet sind, entsprechen, und
• wie gut die bestimmten Klassen 4a-4c der Objektinstanzen den Klassenlabels 2b*, mit denen die Objektinstanzen 2 gekennzeichnet sind, entsprechen.

In step 270, at least one predetermined loss function L is evaluated

• how well the evaluated locations 2a of instances of objects 2 correspond to the locations 2a* with which instances of objects 2 are marked in the query image (8), and
• how well the determined classes 4a-4c of the object instances correspond to the class labels 2b* with which the object instances 2 are identified.

In Schritt 280 werden Parameter 20a, 30a optimiert, die das Verhalten des Lokalisierungsnetzwerks 20 und des Klassifizierungsnetzwerks 30 charakterisieren, so dass sich die Bewertung 270a durch die zumindest eine Verlustfunktion L wahrscheinlich verbessert. Das Training kann basierend auf weiteren Abfragebildmerkmalen 9 und Prototypmerkmalen 5a-5c fortgesetzt werden, bis ein vorbestimmtes Beendigungskriterium erfüllt ist. Die endgültig optimierten Zustände der Parameter 20a, 30a sind mit den Bezugszeichen 20a* bzw. 30a* gekennzeichnet.In step 280, parameters 20a, 30a are optimized, which characterize the behavior of the localization network 20 and of the classification network 30, so that the assessment 270a by the at least one loss function L is likely to improve. Training can continue based on further query image features 9 and prototype features 5a-5c until a predetermined termination criterion is met. The finally optimized states of the parameters 20a, 30a are identified by the reference symbols 20a* and 30a*, respectively.

In Schritt 290 werden Parameter 10a optimiert, die das Verhalten des Merkmalsextraktionsnetzwerks 10 charakterisieren, so dass sich die Bewertung 270a durch die Verlustfunktion L wahrscheinlich verbessert. Dieses Training kann basierend auf weiteren Abfragebildern 8 und Unterstützungsbildern 6 fortgesetzt werden, bis ein vorbestimmtes Abbruchkriterium erfüllt ist. Der endgültig optimierte Zustand der Parameter 10a ist mit dem Bezugszeichen 10a* gekennzeichnet.In step 290, parameters 10a characterizing the behavior of the feature extraction network 10 are optimized such that the assessment 270a by the loss function L is likely to improve. This training can be continued based on further query images 8 and support images 6 until a predetermined termination criterion is met. The finally optimized state of the parameters 10a is identified by the reference symbol 10a*.

Insbesondere können gemäß Block 290a in einer ersten Stufe des Trainings die Parameter 10a basierend auf einem ersten, größeren Satz von Abfragebildern 8 und einem ersten, größeren Satz von Unterstützungsbildern 6 mit Instanzen von Objekten 2 aus einem Satz von Basisklassen C_b optimiert werden. Gemäß Block 290b können die Parameter 10a in einer zweiten Stufe des Trainings basierend auf einem zweiten, kleineren Satz von Abfragebildern 8 und einem zweiten, kleineren Satz von Unterstützungsbildern 6 mit Instanzen von Objekten 2 aus einem Satz von neuen Klassen C_n eingefroren werden.In particular, according to block 290a, in a first stage of training, the parameters 10a can be optimized based on a first, larger set of query images 8 and a first, larger set of support images 6 with instances of objects 2 from a set of base classes C _b . According to block 290b, in a second stage of training, the parameters 10a can be frozen based on a second, smaller set of query images 8 and a second, smaller set of support images 6 with instances of objects 2 from a set of new classes C _n .

3 veranschaulicht die Anwendung des Verfahrens 200 an einem einfachen Beispiel. Ein beispielhaftes Abfragebild 8 zeigt eine Szene auf einem Flughafen, die ein Flugzeug und zwei Personen umfasst. Zwei beispielhafte Unterstützungsbilder 6 für die Klasse „Flugzeug“ umfassen als einziges Objekt 2 nur ein Flugzeug. 3 illustrates the application of the method 200 using a simple example. An example query image 8 shows a scene at an airport that includes an airplane and two people. Two exemplary support images 6 for the “aircraft” class only include an aircraft as the only object 2 .

Das Merkmalsextraktionsnetzwerk 10 extrahiert Abfragebildmerkmale 9 aus dem Abfragebild 9. Aus diesen Abfragebildmerkmalen 9 bestimmt das Lokalisierungsnetzwerk 20 den Ort 2a des Flugzeugs in Form eines Begrenzungsrahmens.The feature extraction network 10 extracts query image features 9 from the query image 9. From these query image features 9, the localization network 20 determines the location 2a of the aircraft in the form of a bounding box.

Das Merkmalsextraktionsnetzwerk 10 extrahiert auch Prototypmerkmale 5a-5c aus den Unterstützungsbildern 6. Diese Prototypmerkmale 5a-5c werden in den Bündelungs-Schichten P zusammengefasst, bevor sie mit den Abfragebildmerkmalen 9 in Fusionen 7a-7c fusioniert werden. Die Fusionen 7a-7c werden in ein Klassifizierungsnetzwerk 30 eingespeist, das bestimmt, zu welcher Klasse 4a-4c jede Instanz eines Objekts 2 in dem Abfragebild 8 gehört. In Bild 3 wird dies nur für das Beispiel des Flugzeugs gezeigt.The feature extraction network 10 also extracts prototype features 5a-5c from the support images 6. These prototype features 5a-5c are combined in the bundling layers P before being merged with the query image features 9 in fusions 7a-7c. The fusions 7a-7c are fed into a classification network 30 which determines to which class 4a-4c each instance of an object 2 in the query image 8 belongs. In Figure 3 this is only shown for the airplane example.

Claims

Method (100) for recognizing and classifying instances of objects (2) in an input image (1), comprising the steps of: • extracting (110) a set of input image features (3) from the input image (1) by a feature extraction network (10) ; • evaluating (120), from the input image features (3), by a localization network (20), locations (2a) of instances of objects (2) in the input image (1); • obtaining (130), for each of a plurality of classes (4a-4c), a set of prototype features (5a-5c) in the domain of the input image features (3a-3c) representing the respective class (4a-4c); and • determining (140), from the input image features (3) in combination with the prototype features (5a-5c), classes (4a-4c) to which the instances of objects (2) in the input image (1) belong.

Method (100) according to claim 1 , wherein obtaining (130) prototype features (5a-5c) extracting (131), by the feature extraction network (10), the prototype features (5a-5c) from at least one support image (6) with an object (2) belonging to the respective class (4a-4c).

Method (100) according to claim 2 , wherein prototype features (5a-5c) are extracted (131a) from a plurality of support images (6) belonging to one and the same class (4a-4c) and to a set of prototype features (5a-5c) belonging to this class (4a- 4c) represent, be aggregated.

Method (100) according to any one of Claims 1 until 3 , wherein in response to the determination (141) that input image features (3) corresponding to an instance of an object (2) are similar to prototype features (5a-5c) of a class (4a-4c) according to a predetermined criterion, determining ( 142) that this instance of an object (2) belongs to the class (4a-4c) represented by the prototype (5a-5c).

Method (100) according to claim 4 wherein the similarity between input image features (3) and prototype features (5a-5c) is evaluated (141a) according to a distance measured between the input image features (3) and the prototype features (5a-5c) in a feature space.

Method (100) according to any one of Claims 1 until 5 , wherein the determination (140) of the classes (4a-4c) an evaluation (143) of fusions (7a-7c) of input image features (3) with prototype features (5a-5c) representing different classes (4a-4c) by a classification network (30).

Method (100) according to claim 6 , where the fusion (7a-7c) is computed (143a) for each class (4a-4c) as a Hadamard product of the input image features (3) with the prototype features (5a-5c) representing that class (4a-4c) becomes.

Method (100) according to any one of Claims 1 until 7 wherein the feature extraction network (10) comprises at least • a plurality of convolutional layers generating feature maps of their respective input, said feature maps having a lower dimensionality than the respective input; • a plurality of upsampling layers that upsample feature maps to a higher dimensionality; and • at least one lateral connection that transfers information from a feature map generated by a convolutional layer into an upsampled feature map of the same dimensionality generated by an upsampling layer.

Method (100) according to any one of Claims 1 until 8th , further comprising: • calculating (150) at least one control signal (150a) from specific locations (2a) and classes (4a-4c) of instances of objects (2); and • activating (160), with the activation signal (150a), a vehicle (50), a robot (60), a monitoring system (70), a medical imaging system (80) and/or a quality inspection system (90).

A method (200) for training a combination of at least one localization network (20) and a classification network (30) for use in the method (100) according to any one of Claims 1 until 9 , comprising the steps of: • providing (210) at least one set of query images (8), each query image (8) comprising one or more instances of objects (2), each such instance having a class label (2b*) and a location (2a*) of the instance of the object (2); • providing (220) at least one set of support images (6), each support image (6) comprising an instance of an object (2) belonging to a class (4a-4c); • extracting (230), for each class (4a-4c), by a feature extraction network (10), from one or more support images (6) with an object (2) belonging to that class (4a-4c), prototype features (5a-5c) representing this class (4a-4c); • extracting (240), by the feature extraction network (10), a set of query image features (9) from each query image (8); • evaluating (250), from these query image features (9), by the localization network (20), locations (2a) of instances of objects (2) in the query image (8); • determining (260), from query image features (9) in combination with prototype features (5a-5c), by the classification network (30), classes (4a-4c) to which the instances of objects (2) in the query image (8 ) belong; • Evaluate (270), using at least one predetermined loss function (L), ◯ how well the evaluated locations (2a) of object instances (2) the locations (2a*) with which object instances (2) are identified in the query image (8), and ◯ how well the determined classes (4a-4c) of the object instances correspond to the class labels (2b*), with to which the object instances (2) are identified correspond; and • optimizing (280) parameters (20a, 30a) characterizing the behavior of the localization network (20) and the classification network (30) such that the assessment (270a) by the at least one loss function (L) is likely to improve.

Method (200) according to claim 10 , further comprising optimizing (290) parameters (10a) characterizing the behavior of the feature extraction network (10) such that the assessment (270a) by the loss function (L) is likely to improve.

Method (200) according to claim 11 , where • in a first stage based on a first, larger set of query images (8) and a first, larger set of support images (6) with object (2) instances from a set of base classes C _b the parameters (10a), that characterize the behavior of the feature extraction network (10) are optimized (290a), and • in a second stage based on a second, smaller set of query images (8) and a second, smaller set of support images (6) with object (2 ) instances from a set of new classes C _n the parameters (10a) characterizing the behavior of the feature extraction network (10) are frozen.

Method (200) according to any one of Claims 10 until 12 , wherein the set of support images (6) for use with at least one query image (8) is constructed (221) to contain instances of objects (2) that only belong to a subset of the classes (4a-4c) with which instances of objects (2) in the query image (8) are marked (2b*), belong.

Method (200) according to any one of Claims 10 until 13 , wherein the set of support images (6) for use with at least one query image (8) is constructed (222) to contain instances of objects (2) belonging to a union of the classes (4a-4c) with which instances of Objects (2) in the query image (8) are marked (2b*), and additional, randomly selected classes (4a-4c) belong.

A computer program comprising machine-readable instructions which, when executed by one or more computers or computing entities, cause the one or more computers to perform a method (100, 200) according to any one of Claims 1 until 14 to execute.

Non-volatile storage medium with the computer program claim 15 .

One or more computers or computing instances with the computer program claim 15 and/or with the non-volatile storage medium Claim 16 .