DE102019217444A1

DE102019217444A1 - Method and device for classifying digital image data

Info

Publication number: DE102019217444A1
Application number: DE102019217444.2A
Authority: DE
Inventors: Stefan Schmid; Claudia Blaiotta; Benedikt Sebastian Staffler
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2021-05-12

Abstract

Verfahren und Vorrichtung zur Klassifizierung digitaler Bilddaten, insbesondere zur Objektdetektion, gekennzeichnet durch Empfangen von Daten eines digitalen Bildes (102) und Kontextinformationen (104) für das digitale Bild (102), Bestimmen einer Darstellung (108) der Kontextinformationen für das digitale Bild an einer Einbettungsanordnung (106) in Abhängigkeit von den Kontextinformationen (104), wobei die Einbettungsanordnung (106) trainiert ist, Darstellungen von Kontextinformationen von digitalen Bildern aus Kontextinformationen für die digitalen Bilder zu bestimmen, Bestimmen (208, 408, 608, 808) einer Klasse (112) für das digitale Bild an einer Klassifizierungsanordnung (110) in Abhängigkeit von den Daten des digitalen Bildes (102) und in Abhängigkeit von der Darstellung (108) der Kontextinformationen (104) für das digitale Bild (102), wobei die Klassifizierungsanordnung (110) trainiert ist, Klassen für digitale Bilder in Abhängigkeit von Daten von digitalen Bildern und Darstellungen von Kontextinformationen für digitale Bilder zu bestimmen, wobei zumindest eine Zustandsvariable (114) der Klassifizierungsanordnung (110) in Abhängigkeit von der Darstellung (108) der Kontextinformationen (104) bestimmt wird und wobei die Klasse (112) des digitalen Bildes (102) in Abhängigkeit von der zumindest einen Zustandsvariable (114) bestimmt wird.Method and device for classifying digital image data, in particular for object detection, characterized by receiving data from a digital image (102) and context information (104) for the digital image (102), determining a representation (108) of the context information for the digital image on a Embedding arrangement (106) as a function of the context information (104), the embedding arrangement (106) being trained to determine representations of context information of digital images from context information for the digital images, determining (208, 408, 608, 808) a class ( 112) for the digital image to a classification arrangement (110) as a function of the data of the digital image (102) and as a function of the representation (108) of the context information (104) for the digital image (102), the classification arrangement (110 ) is trained to create classes for digital images depending on data from digital images and representations gen of context information for digital images, wherein at least one state variable (114) of the classification arrangement (110) is determined as a function of the representation (108) of the context information (104) and wherein the class (112) of the digital image (102) in Depending on the at least one state variable (114) is determined.

Description

Stand der TechnikState of the art

Die Erfindung betrifft ein Verfahren und eine Vorrichtung zur Klassifizierung digitaler Bilddaten, die insbesondere zur Objekterkennung verwendet werden.The invention relates to a method and a device for classifying digital image data, which are used in particular for object recognition.

Es ist wünschenswert, ein in hohem Maße robustes und effizientes Verfahren und eine entsprechende Vorrichtung zur Klassifizierung digitaler Bilddaten zur Objekterkennung bereitzustellen.It is desirable to provide a highly robust and efficient method and a corresponding device for classifying digital image data for object recognition.

Kurze Darstellung der ErfindungSummary of the invention

Ein Verfahren und eine Vorrichtung entsprechend den unabhängigen Ansprüchen erreichen dies.A method and an apparatus according to the independent claims achieve this.

Ein Verfahren zur Klassifizierung digitaler Bilddaten, insbesondere zur Objektdetektion, umfasst Empfangen von Daten eines digitalen Bildes und Kontextinformationen für das digitale Bild, Bestimmen einer Darstellung der Kontextinformationen für das digitale Bild an einer Einbettungsanordnung in Abhängigkeit von den Kontextinformationen, wobei die Einbettungsanordnung trainiert ist, Darstellungen von Kontextinformationen von digitalen Bildern aus Kontextinformationen für die digitalen Bilder zu bestimmen, Bestimmen einer Klasse für das digitale Bild an einer Klassifizierungsanordnung in Abhängigkeit von den Daten des digitalen Bildes und in Abhängigkeit von der Darstellung der Kontextinformationen für das digitale Bild, wobei die Klassifizierungsanordnung trainiert ist, Klassen für digitale Bilder in Abhängigkeit von Daten von digitalen Bildern und Darstellungen von Kontextinformationen für digitale Bilder zu bestimmen, wobei zumindest eine Zustandsvariable der Klassifizierungsanordnung in Abhängigkeit von der Darstellung der Kontextinformationen bestimmt wird und wobei die Klasse des digitalen Bildes in Abhängigkeit von der zumindest einen Zustandsvariable bestimmt wird. Das System besteht aus zwei Hauptkomponenten: der Kontexteinbettungsanordnung, die verwendet wird, um Informationen aus Metadaten des digitalen Bildes zu codieren, und der Klassifizierungsanordnung, die die tatsächliche Erkennungsaufgabe durchführt, die als Eingabe sowohl Bilddaten als auch die durch die Einbettungsanordnung berechnete Kontexteinbettung empfängt. Die Kontexteinbettung kann auf Metadaten, Kontextdaten oder A-priori-Wissen über die Domäne/Szene basieren. Dieses Verfahren erlaubt robustes Klassifizieren digitaler Bilder basierend auf dem Kontext.A method for classifying digital image data, in particular for object detection, comprises receiving data of a digital image and context information for the digital image, determining a representation of the context information for the digital image on an embedding arrangement as a function of the context information, the embedding arrangement being trained, representations to determine context information of digital images from context information for the digital images, determining a class for the digital image on a classification arrangement as a function of the data of the digital image and as a function of the representation of the context information for the digital image, the classification arrangement being trained To determine classes for digital images as a function of data from digital images and representations of context information for digital images, wherein at least one state variable of the classification arrangement is determined as a function of the representation of the context information and wherein the class of the digital image is determined as a function of the at least one state variable. The system consists of two main components: the context embedding device, which is used to encode information from metadata of the digital image, and the classification device, which performs the actual recognition task, which receives as input both image data and the context embedding computed by the embedding device. Context embedding can be based on metadata, context data or a priori knowledge of the domain / scene. This method allows robust classification of digital images based on context.

Vorzugsweise werden die Kontextinformationen durch Metadaten definiert, die den Kontext des digitalen Bildes beschreiben, insbesondere einen Ort, eine Zeit, ein Datum und Sensordaten. Diese Metadaten, Kontextdaten oder dieses A-priori-Wissen über die Domäne/Szene ist problemlos aus Bildern verfügbar und verbessert die Klassifizierung deutlich.The context information is preferably defined by metadata that describe the context of the digital image, in particular a location, a time, a date and sensor data. This metadata, context data or this a priori knowledge about the domain / scene is easily available from images and significantly improves the classification.

Vorzugsweise umfasst die Einbettungsanordnung eine Darstellung eines Graphen, der Kontextinformationen für digitale Bilder auf Elemente abbildet, die Kontextinformationen darstellen, wobei ein Untergraph, der Elemente definiert, die die Kontextinformationen für das digitale Bild darstellen, in Abhängigkeit von den Kontextinformationen für das digitale Bild bestimmt wird und wobei die Zustandsvariable der Klassifizierungsanordnung in Abhängigkeit von den durch den Untergraphen definierten Elementen bestimmt wird. Dies bietet eine effektive Klassifizierung für Objekterkennung.The embedding arrangement preferably comprises a representation of a graph that maps context information for digital images onto elements that represent context information, a subgraph that defines elements that represent the context information for the digital image being determined as a function of the context information for the digital image and wherein the state variable of the classification arrangement is determined as a function of the elements defined by the subgraph. This provides an effective classification for object recognition.

Vorzugsweise ist die Darstellung der Kontextinformationen ein Vektor, insbesondere mit einer vorbestimmten Dimension, in einem Vektorraum, wobei die Einbettungsanordnung ein künstliches neuronales Netzwerk umfasst, das trainiert ist, um den Vektor in Abhängigkeit von den Kontextinformationen zu bestimmen. Diese Darstellung der Kontextinformationen ist einfach in ein künstliches neuronales Netzwerk zu integrieren.The representation of the context information is preferably a vector, in particular with a predetermined dimension, in a vector space, the embedding arrangement comprising an artificial neural network which is trained to determine the vector as a function of the context information. This representation of the context information can be easily integrated into an artificial neural network.

Vorzugsweise umfasst die Klassifizierungsanordnung ein künstliches neuronales Netzwerk mit einer Eingabeschicht für Daten von digitalen Bildern und einer Ausgabeschicht für die Klasse, wobei die Zustandsvariable einen Zustand einer verborgenen Schicht in einem Zustandsraum definiert, wobei die verborgene Schicht zwischen der Eingabeschicht und der Ausgabeschicht des künstlichen neuronalen Netzwerks angeordnet ist.Preferably, the classification arrangement comprises an artificial neural network with an input layer for data from digital images and an output layer for the class, the state variable defining a state of a hidden layer in a state space, the hidden layer between the input layer and the output layer of the artificial neural network is arranged.

Vorzugsweise wird zumindest ein Attribut für das digitale Bild in Abhängigkeit von den Daten des digitalen Bildes und der Darstellung der Kontextinformationen für das digitale Bild bestimmt, wobei die Klassifizierungsanordnung trainiert ist, Attribute für digitale Bilder in Abhängigkeit von Daten von digitalen Bildern und Kontextinformationen für digitale Bilder zu bestimmen. Dies stellt sicher, dass die berechnete Merkmalsdarstellung des Eingabebildes konsistent mit den im Wissensgraphen enthaltenen Informationen ist.At least one attribute for the digital image is preferably determined as a function of the data of the digital image and the representation of the context information for the digital image, the classification arrangement being trained, attributes for digital images as a function of data from digital images and context information for digital images to determine. This ensures that the calculated feature representation of the input image is consistent with the information contained in the knowledge graph.

Vorzugsweise umfasst die Klassifizierungsanordnung ein künstliches neuronales Netzwerk mit einer Eingabeschicht für Daten von digitalen Bildern und einer Ausgabeschicht für Ausgabe der Klasse, wobei die Zustandsvariable einen Zustand der Eingabeschicht des künstlichen neuronalen Netzwerks im Zustandsraum definiert. Dies ist ein einfacher Weg des Integrierens der Kontextinformationen in das neuronale Netzwerk.The classification arrangement preferably comprises an artificial neural network with an input layer for data from digital images and an output layer for output of the class, the state variable defining a state of the input layer of the artificial neural network in the state space. This is a simple way of integrating the context information into the neural network.

Die Klassifizierungsanordnung kann ein künstliches neuronales Netzwerk mit einer Eingabeschicht für Daten von digitalen Bildern und einer Ausgabeschicht für Ausgabe der Klasse umfassen, wobei die Zustandsvariable einen Zustand der Ausgabeschicht des künstlichen neuronalen Netzwerks im Zustandsraum definiert. Dieses künstliche neuronale Netzwerk kann durch Trainieren der Parameter lediglich der Ausgabeschicht neu trainiert werden.The classification arrangement may comprise an artificial neural network with an input layer for data from digital images and an output layer for output of the class, the state variable defining a state of the output layer of the artificial neural network in the state space. This artificial neural network can be retrained by training the parameters of only the output layer.

Ein Verfahren zum Trainieren der Anordnungen umfasst Bereitstellen von Trainingsdaten, wobei die Trainingsdaten Trainingsdatenpunkte umfassen, wobei jeder der Trainingsdatenpunkte Daten eines digitalen Bildes, Kontextinformationen für das digitale Bild und Informationen über die Klasse für das digitale Bild umfasst, wobei das Verfahren umfasst, für jeden Trainingsdatenpunkt an einer Einbettungsanordnung die Darstellung der Kontextinformationen in Abhängigkeit von den Kontextinformationen und an der Klassifizierungsanordnung eine Klasse für das digitale Bild in Abhängigkeit von den Daten des digitalen Bildes und der Darstellung der Kontextinformationen zu bestimmen, wobei zumindest eine Zustandsvariable der Klassifizierungsanordnung in Abhängigkeit von der Darstellung der Kontextinformationen bestimmt wird, und wobei die Klasse des digitalen Bildes in Abhängigkeit von der zumindest einen Zustandsvariable bestimmt wird, wobei ein Parameter für die Klassifizierungsanordnung und/oder die Einbettungsanordnung in einem Gradientenverfahren in Abhängigkeit von mehreren Trainingsdatenpunkten und von für die mehreren Trainingsdatenpunkte bestimmten Klassen bestimmt wird.A method for training the arrangements comprises providing training data, the training data comprising training data points, each of the training data points comprising data of a digital image, context information for the digital image and information about the class for the digital image, the method comprising for each training data point to determine the representation of the context information on an embedding arrangement as a function of the context information and on the classification arrangement a class for the digital image as a function of the data of the digital image and the representation of the context information, with at least one state variable of the classification arrangement depending on the representation of the Context information is determined, and the class of the digital image is determined as a function of the at least one state variable, a parameter for the classification arrangement and / or the E embedding arrangement is determined in a gradient method as a function of several training data points and of classes intended for the several training data points.

Eine entsprechende Vorrichtung zum Klassifizieren digitaler Bilddaten, insbesondere für Objekterkennung, umfasst eine Klassifizierungsanordnung und eine Einbettungsanordnung, die dazu angepasst sind, das Verfahren auszuführen.A corresponding device for classifying digital image data, in particular for object recognition, comprises a classification arrangement and an embedding arrangement which are adapted to carry out the method.

Weitere vorteilhafte Aspekte werden in der folgenden Beschreibung und den Zeichnungen offenbart. In den Zeichnungen:

1 stellt Aspekte einer Vorrichtung für Objekterkennung dar,
2 stellt Schritte in einem Verfahren zur Objekterkennung dar,
3 stellt weitere Aspekte der Vorrichtung für Objekterkennung dar,
4 stellt Aspekte des Verfahrens zur Objekterkennung dar,
5 stellt weitere Aspekte der Vorrichtung für Objekterkennung dar,
6 stellt weitere Aspekte des Verfahrens zur Objekterkennung dar,
7 stellt weitere Aspekte der Vorrichtung für Objekterkennung dar,
8 stellt weitere Aspekte des Verfahrens zur Objekterkennung dar,
9 stellt Aspekte eines Trainingsverfahrens dar.

Further advantageous aspects are disclosed in the following description and drawings. In the drawings:

1 represents aspects of a device for object recognition,
2 represents steps in a method for object recognition,
3rd represents further aspects of the device for object recognition,
4th represents aspects of the method for object recognition,
5 represents further aspects of the device for object recognition,
6th represents further aspects of the method for object recognition,
7th represents further aspects of the device for object recognition,
8th represents further aspects of the method for object recognition,
9 represents aspects of a training procedure.

1 stellt einen Aspekt einer Vorrichtung 100 zum Klassifizieren digitaler Bilddaten, insbesondere für Objekterkennung, dar. Die Vorrichtung 100 ist dazu angepasst, Bilddaten eines digitalen Bildes 102 und Kontextinformationen 104 für das digitale Bild zu verarbeiten. Die Kontextinformationen 104 sind Metadaten, Kontextdaten oder A-priori-Wissen über die Domäne/Szene des digitalen Bildes 102. Diese Daten werden vorzugsweise in einer Datendatei gespeichert, z. B. entsprechend dem JPEG-Dateiaustauschformat, umfassend die digitalen Bilddaten des digitalen Bildes 102. Die Metadaten, Kontextdaten oder das A-priori-Wissen über die Domäne/Szene können den Kontext des digitalen Bildes 102 beschreiben, insbesondere einen Ort, eine Zeit, ein Datum, Sensordaten, beispielsweise in einem austauschbaren Bilddateiformat (EXIF, Exchangeable Image File Format). 1 illustrates one aspect of a device 100 for classifying digital image data, in particular for object recognition. The device 100 is adapted to image data of a digital image 102 and context information 104 for the digital image to be processed. The context information 104 are metadata, context data or a priori knowledge about the domain / scene of the digital image 102 . This data is preferably stored in a data file, e.g. According to the JPEG file exchange format, comprising the digital image data of the digital image 102 . The metadata, context data or the a priori knowledge about the domain / scene can determine the context of the digital image 102 describe, in particular a location, a time, a date, sensor data, for example in an exchangeable image file format (EXIF, Exchangeable Image File Format).

Die Vorrichtung 100 umfasst eine Einbettungsanordnung 106, dazu angepasst, die Kontextinformationen 104 für das digitale Bild 102, in spezieller Form die Datendatei, zu empfangen. Die Einbettungsanordnung 106 ist dazu angepasst, eine Darstellung 108 der Kontextinformationen 104 für das digitale Bild 102 in Abhängigkeit von den Kontextinformationen 104 zu bestimmen. Die Einbettungsanordnung 106 ist dazu trainiert, insbesondere entsprechend einem Verfahren, das nachfolgend beschrieben wird, Darstellungen von Kontextinformationen von digitalen Bildern aus Kontextinformationen für die digitalen Bilder zu bestimmen.The device 100 includes an embedding arrangement 106 , adapted to the context information 104 for the digital image 102 , to receive the data file in a special form. The embedding arrangement 106 is adapted to a representation 108 the context information 104 for the digital image 102 depending on the context information 104 to determine. The embedding arrangement 106 is trained to determine representations of context information of digital images from context information for the digital images, in particular according to a method that is described below.

Die Vorrichtung 100 umfasst eine Klassifizierungsanordnung 110, dazu angepasst, eine Klasse 112 für das digitale Bild 102 in Abhängigkeit von den Daten des digitalen Bildes 102 und in Abhängigkeit von der Darstellung 108 der Kontextinformationen 104 für das digitale Bild 102 zu bestimmen. Die Klassifizierungsanordnung 110 ist dazu trainiert, insbesondere entsprechend einem Verfahren, das nachfolgend beschrieben wird, Klassen für digitale Bilder in Abhängigkeit von Daten von digitalen Bildern und Darstellungen von Kontextinformationen für digitale Bilder zu bestimmen.The device 100 includes a classification arrangement 110 , adapted to a class 112 for the digital image 102 depending on the data of the digital image 102 and depending on the presentation 108 the context information 104 for the digital image 102 to determine. The classification arrangement 110 is trained to determine classes for digital images as a function of data from digital images and representations of context information for digital images, in particular according to a method that is described below.

Die Bilddaten werden in dem Beispiel als ein Tensor verarbeitet, umfassend Intensitätswerte jedes Pixel des digitalen Bildes 102. Die Kontextinformationen 104 im Beispiel umfassend Werte der Kontextinformationen für jedes Pixel. Diese Werte werden im Beispiel zu dem Tensor addiert. Im Beispiel wird jedes Pixel durch drei Intensitätswerte für rot, grün bzw. blau und alle Werte der Kontextinformationen beschrieben. Die Einbettungsanordnung 106 und die Klassifizierungsanordnung 110 können jeweils dazu angepasst sein, diesen Tensor als Eingabe zu verarbeiten. Die Einbettungsanordnung 106 kann dazu angepasst sein, nur die Kontextinformationen 104 als Eingabe zu verarbeiten. Die Klassifizierungsanordnung 110 kann dazu angepasst sein, nur die Intensitätswerte als Eingabe zu verarbeiten. Die Klasse 112 ist im Beispiel ein Vektor der Dimension N, wobei N die Anzahl von Klassen ist, die zu klassifizieren die Klassifizierungsanordnung 110 trainiert ist. Die Klassifizierungsanordnung 110 kann dazu angepasst sein, den Vektor durch One-Hot-Codierung zu bestimmen, wo das Element des Vektors, das einen Wert von eins hat oder das den höchsten Wert aller Vektorelemente hat, die Klasse identifiziert, die dem digitalen Bild 102 zugeordnet ist.The image data is processed as a tensor in the example, comprising intensity values of each pixel of the digital image 102 . The context information 104 in the example comprising values of the context information for each pixel. In the example, these values are added to the tensor. In the example, each pixel is described by three intensity values for red, green and blue and all values of the context information. The embedding arrangement 106 and the classification arrangement 110 can each be adapted to process this tensor as input. The embedding arrangement 106 can be adapted to this, only the context information 104 process as input. The classification arrangement 110 can be adapted to process only the intensity values as input. The class 112 is in the example a vector of dimension N, where N is the number of classes that the classification system has to classify 110 is trained. The classification arrangement 110 may be adapted to determine the vector by one-hot coding, where the element of the vector that has a value of one or that has the highest value of all vector elements identifies the class that corresponds to the digital image 102 assigned.

Die Klassifizierungsanordnung 110 umfasst zumindest eine Zustandsvariable 114. Die Klassifizierungsanordnung 110 ist dazu angepasst, die zumindest eine Zustandsvariable 114 in Abhängigkeit von der Darstellung 108 der Kontextinformationen 104 zu gewichten. Die Klassifizierungsanordnung 110 ist dazu angepasst, die Klasse 112 des digitalen Bildes 102 in Abhängigkeit von der zumindest einen Zustandsvariable 114 zu bestimmen.The classification arrangement 110 includes at least one state variable 114 . The classification arrangement 110 is adapted to the at least one state variable 114 depending on the presentation 108 the context information 104 to weight. The classification arrangement 110 is adapted to the class 112 of the digital image 102 depending on the at least one state variable 114 to determine.

Ein Verfahren zum Klassifizieren digitaler Bilddaten insbesondere zur Objektdetektion wird nachfolgend Bezug nehmend auf 2 erläutert.A method for classifying digital image data, in particular for object detection, is described below with reference to 2 explained.

Nach dem Start wird ein Schritt 202 ausgeführt.After starting, there will be a step 202 executed.

In Schritt 202 werden Daten des digitalen Bildes 102 und Kontextinformationen 104 für das digitale Bild 102 empfangen. Die Kontextinformationen 104 werden durch Metadaten, Kontextdaten oder A-priori-Wissen über die Domäne/Szene definiert, die den Kontext des digitalen Bildes 102 beschreiben, insbesondere den Ort, die Zeit, das Datum und/oder Sensordaten, z. B. eines Sensors der Kamera, die zum Aufnehmen des digitalen Bildes verwendet wurde.In step 202 become data of the digital image 102 and context information 104 for the digital image 102 receive. The context information 104 are defined by metadata, context data or a priori knowledge of the domain / scene that defines the context of the digital image 102 describe, in particular the location, the time, the date and / or sensor data, e.g. B. a sensor of the camera that was used to take the digital image.

Danach wird ein Schritt 204 ausgeführt.After that is a step 204 executed.

In Schritt 204 wird die Darstellung 108 der Kontextinformationen für das digitale Bild 102 an der Einbettungsanordnung 106 in Abhängigkeit von den Kontextinformationen 104 bestimmt. Die Einbettungsanordnung 106 ist dazu trainiert, Darstellungen von Kontextinformationen von digitalen Bildern aus Kontextinformationen für die digitalen Bilder zu bestimmen.In step 204 becomes the representation 108 the context information for the digital image 102 at the embedding arrangement 106 depending on the context information 104 certainly. The embedding arrangement 106 is trained to determine representations of context information of digital images from context information for the digital images.

Danach wird ein Schritt 206 ausgeführt.After that is a step 206 executed.

In Schritt 206 wird zumindest eine Zustandsvariable 114 der Klassifizierungsanordnung 110 in Abhängigkeit von der Darstellung 108 der Kontextinformationen 104 bestimmt.In step 206 becomes at least one state variable 114 the classification arrangement 110 depending on the presentation 108 the context information 104 certainly.

Danach wird ein Schritt 208 ausgeführt.After that is a step 208 executed.

In Schritt 208 wird die Klasse 112 für das digitale Bild 102 an der Klassifizierungsanordnung 110 in Abhängigkeit von den Daten des digitalen Bildes 102 und in Abhängigkeit von der Darstellung 108 der Kontextinformationen 104 für das digitale Bild 102 bestimmt. Die Klasse 112 des digitalen Bildes 102 wird in Abhängigkeit von der zumindest einen Zustandsvariable 114 bestimmt. Die Klassifizierungsanordnung 110 ist dazu trainiert, Klassen für digitale Bilder in Abhängigkeit von Daten von digitalen Bildern und Darstellungen von Kontextinformationen für digitale Bilder zu bestimmen.In step 208 becomes the class 112 for the digital image 102 on the classification arrangement 110 depending on the data of the digital image 102 and depending on the presentation 108 the context information 104 for the digital image 102 certainly. The class 112 of the digital image 102 is dependent on the at least one state variable 114 certainly. The classification arrangement 110 is trained to determine classes for digital images based on data from digital images and representations of context information for digital images.

Danach endet das Verfahren.Then the process ends.

In 3 werden weitere Aspekte der Vorrichtung 100 schematisch dargestellt.In 3rd are further aspects of the device 100 shown schematically.

Die Einbettungsanordnung 106 entsprechend diesem Aspekt umfasst eine Darstellung eines Graphen 302, der Kontextinformationen für digitale Bilder auf Elemente, die Kontextinformationen darstellen, abbildet. Der Graph 302 kann ein Wissensgraph sein.The embedding arrangement 106 in accordance with this aspect includes a representation of a graph 302 that maps context information for digital images onto elements that represent context information. The graph 302 can be a knowledge graph.

Die Einbettungsanordnung 106 ist dazu angepasst, einen Untergraphen 304, der Elemente definiert, die die Kontextinformationen 104 für das digitale Bild 102 darstellen, in Abhängigkeit von den Kontextinformationen 104 für das digitale Bild 102 zu bestimmen.The embedding arrangement 106 is adapted to a subgraph 304 that defines elements that contain the context information 104 for the digital image 102 represent, depending on the context information 104 for the digital image 102 to determine.

Die Einbettungsanordnung 106 kann ein künstliches neuronales Netzwerk 306 umfassen, das dazu trainiert ist, einen Vektor 308 in Abhängigkeit von den Kontextinformationen 104 zu bestimmen. Das künstliche neuronale Netzwerk 306 kann ein neuronales Graphen-Netzwerk sein, das dazu angepasst ist, den Vektor 308 basierend auf den Elementen des Untergraphen 304 zu bestimmen. The embedding arrangement 106 can be an artificial neural network 306 that is trained to include a vector 308 depending on the context information 104 to determine. The artificial neural network 306 can be a neural graph network that is adapted to the vector 308 based on the elements of the subgraph 304 to determine.

Dieses System besteht aus zwei Hauptkomponenten: einer Graphen-Einbettungsanordnung, die verwendet wird, um Informationen, die in einem (aufgabenspezifischen) Wissensgraphen enthalten sind, zu codieren; der Klassifizierungsanordnung, z. B. einem faltenden neuronalen Netzwerk, die die tatsächliche Erkennungsaufgabe durchführt, die als Eingabe sowohl Bilddaten als auch die durch die Graphen-Einbettungsanordnung berechnete Kontexteinbettung empfängt. Um für Objekterkennung zu trainieren oder um Objekterkennung durchzuführen, werden die folgenden Berechnungen ausgeführt: Das System erhält als Eingabe ein Bild, z. B. eines Straßenzeichens, sowie Metadaten, Kontextdaten oder A-priori-Wissen über die Domäne/Szene, die den Kontext des Bildes beschreibt, z. B. Ort, Zeit, Datum, Sensordaten. Die Kontextdaten werden dann verwendet, um den Untergraphen aus dem Wissensgraphen zu extrahieren, beispielsweise durch Auswählen aller Knoten, die mit Objekten verbunden sind, die in den Metadaten, Kontextdaten oder im A-priori-Wissen über die Domäne/Szene vorhanden sind. Als ein Beispiel könnten, in einem Verwendungsfall von Straßenzeichenerkennung, Metadaten, Kontextdaten oder A-priori-Wissen über die Domäne/Szene aus Land und Straßentyp, z.B. Stadtstraße, Autobahn, bestehen. Der Untergraph wird dann in einem Vektorraum mit fester Dimension unter Verwendung der Graphen-Einbettungsanordnung, beispielsweise eines neuronalen Graphen-Netzwerks, codiert. Die Bilddaten dienen als Eingabe in das faltende neuronale Netzwerk. Zusätzlich wird, zu den regelmäßigen Faltungen, ein Aufmerksamkeitsmechanismus verwendet, um die Darstellungen in den verborgenen Schichten basierend auf dem Kontextvektor zu modifizieren. Beispielsweise werden die verborgenen Schichten durch die Ausgabe einer linearen Transformation des Kontextvektors gewichtet, gefolgt von einer Softmax-Nichtlinearität, die nach jeder regelmäßigen Faltung im Netzwerk durchgeführt wird. Das faltende Netzwerk sowie die Gewichte des Aufmerksamkeitsmechanismus werden in einer Standardweise unter Verwendung von Rückpropagierung trainiert, wie nachfolgend beschrieben. Im Falle von neuronalen Graphen-Netzwerken kann die Grapheneinbettung Endezu-Ende gelernt werden, zusätzlich zu dem faltenden Netzwerk für die Bilddaten.This system consists of two main components: a graph embedding arrangement which is used to encode information contained in a (task-specific) knowledge graph; the classification arrangement, e.g. A convolutional neural network that performs the actual recognition task that receives as input both image data and the context embedding computed by the graph embedding arrangement. To train for object recognition or to perform object recognition, the following calculations are performed: The system receives an image as input, e.g. B. a street sign, as well as metadata, context data or a priori knowledge about the domain / scene that describes the context of the image, e.g. B. Place, time, date, sensor data. The context data is then used to extract the subgraph from the knowledge graph, for example by selecting all nodes associated with objects that are present in the metadata, context data or in the a priori knowledge about the domain / scene. As an example, in a use case of road sign recognition, metadata, context data or a priori knowledge of the domain / scene could consist of the country and the type of road, eg city road, highway. The subgraph is then encoded in a fixed dimension vector space using the graph embedding arrangement such as a graph neural network. The image data serve as input to the folding neural network. In addition to the regular convolutions, an attention mechanism is used to modify the representations in the hidden layers based on the context vector. For example, the hidden layers are weighted by outputting a linear transformation of the context vector, followed by a softmax non-linearity that is performed after every regular convolution in the network. The folding network as well as the weights of the attention mechanism are trained in a standard manner using back propagation, as described below. In the case of neural graph networks, the graph embedding can be learned end-to-end, in addition to the convolving network for the image data.

Die Darstellung 108 der Kontextinformationen 104 in diesem Beispiel ist der Vektor 308, insbesondere mit einer vorbestimmten Dimension, in einem Vektorraum.The representation 108 the context information 104 in this example the vector is 308 , in particular with a predetermined dimension, in a vector space.

Die Klassifizierungsanordnung 110 ist dazu angepasst, die Zustandsvariable 114 der Klassifizierungsanordnung 110 in Abhängigkeit von den durch den Untergraphen 304 definierten Elementen zu gewichten. Die Klassifizierungsanordnung ist dazu angepasst, die Zustandsvariable 114 der Klassifizierungsanordnung 110 zu gewichten, in diesem Beispiel durch den Vektor 308.The classification arrangement 110 is adapted to this, the state variable 114 the classification arrangement 110 depending on the through the subgraph 304 to weight defined elements. The classification arrangement is adapted to the state variable 114 the classification arrangement 110 to be weighted, in this example by the vector 308 .

In dem Beispiel wird der Vektor 308 durch eine lineare Schicht geführt, gefolgt von einer Softmax-Nichtlinearität 310, d. h. einer Sigmoid-Nichtlinearität. Im Beispiel ist die Zustandsvariable 114 eine Merkmalsabbildung einer verborgenen Schicht 312, z. B. einer Aufmerksamkeitsschicht, eines tiefen faltenden neuronalen Netzwerks 316. In einem Beispiel wird ein elementweises Produkt 314 des Vektors 308, der die lineare Schicht durchlaufen hat, gefolgt von der Softmax-Nichtlinearität 310, und der Merkmalsabbildung angewendet, um einen Ausgabevektor für die Klasse 112 zu bestimmen. Elementweise beschreibt den Fall eines Produkts mit einem Vektor, z. B. einer Ausgabe einer vollständig zusammenhängenden Schicht, wo jedes Element des Vektors typischerweise ein „Merkmal“ genannt wird. Dies bedeutet, dass die Größe des Vektors gleich der Anzahl von Merkmalen ist. In faltenden Netzwerken enthalten allerdings die verborgenen Zustände Merkmale für jede Position, die Kanäle genannt werden. Das bedeutet, verborgene Schichtentensoren werden verwendet, die eine Breite x und eine Höhe x entsprechend der Anzahl von Merkmalen haben. Für solche faltenden neuronalen Netzwerke kann ein „kanalweises“ Produkt verwendet werden: Jedes Element der Ausgabe der Nichtlinearität 310 wird mit allen Elementen in einem einzelnen Kanal einer Merkmalsabbildung des faltenden neuronalen Netzwerks zu dieser Phase multipliziert. Das tiefe faltende neuronale Netzwerk 316 in diesem Beispiel umfasst eine Eingabeschicht 318 und kann mehrere verborgene Schichten 320 zwischen der Eingabeschicht 318 und der Aufmerksamkeitsschicht 312 enthalten. In dem speziellen Fall, dass eine Merkmalsabbildung der verborgenen Schicht 320 ein Vektor ist, entspricht dies dem elementweisen Produkt der Ausgabe der Nichtlinearität 310 mit einem Vektor der verborgenen Schicht 320. Die Länge der Ausgabe der Nichtlinearität 310 muss gleich der Anzahl der Kanäle der Merkmalsabbildung in der verborgenen Schicht 320 sein. Es kann mehrere dieser Aufmerksamkeitsschichten geben, z. B. eine an jeder verborgenen Schicht des faltenden neuronalen Netzwerks 316.In the example, the vector 308 guided through a linear slice, followed by a softmax non-linearity 310 , ie a sigmoid non-linearity. In the example, the status variable is 114 a feature map of a hidden layer 312 , e.g. B. an attention layer, a deep convolutional neural network 316 . In one example, an element-wise product 314 of the vector 308 that has passed through the linear slice, followed by the softmax nonlinearity 310 , and the feature mapping is applied to an output vector for the class 112 to determine. Element-wise describes the case of a product with a vector, e.g. B. an output of a completely contiguous layer, where each element of the vector is typically called a "feature". This means that the size of the vector is equal to the number of features. In convolutional networks, however, the hidden states contain features for each position called channels. That is, hidden layer tensors are used which have a width x and a height x corresponding to the number of features. For such convolutional neural networks, a “channel by channel” product can be used: each element of the output of the nonlinearity 310 is multiplied by all the elements in a single channel of a feature map of the convolutional neural network at that phase. The deep folding neural network 316 in this example comprises an input layer 318 and can have multiple hidden layers 320 between the input layer 318 and the attention layer 312 contain. In the special case that a feature map of the hidden layer 320 is a vector, this is the element-wise product of the output of the non-linearity 310 with a hidden layer vector 320 . The length of the output of the nonlinearity 310 must be equal to the number of channels of the feature mapping in the hidden layer 320 be. There can be several of these layers of attention, e.g. One at each hidden layer of the convolutional neural network 316 .

Optional kann die Klassifizierungsanordnung 110 eine Attributausgabe 322 umfassen, die dazu angepasst ist, ein Attribut für das digitale Bild 102 auszugeben. Die Attributausgabe 322 kann aus einer optionalen zusätzlichen verborgenen Schicht 324 resultieren, die in dem Beispiel als eine Aufmerksamkeitsschicht zum Bestimmen der Attributausgabe 322 ausgestaltet ist. Die Zustandsvariable 114 kann auch anstelle der optionalen zusätzlichen verborgenen Schicht 324 verwendet werden.Optionally, the classification arrangement 110 an attribute output 322 which is adapted to include an attribute for the digital image 102 to spend. The attribute output 322 can be made up of an optional additional hidden layer 324 which in the example result as an attention layer for determining the attribute output 322 is designed. The state variable 114 can also be used instead of the optional additional hidden layer 324 be used.

Das künstliche neuronale Netzwerk 316 hat die Eingabeschicht 318 für Daten von digitalen Bildern und eine Ausgabeschicht 326 für die Klasse 112. Die Ausgabeschicht 326 ist im Beispiel eine Klassifizierungsschicht, umfassend eine lineare Schicht, gefolgt von einer Softmax-Nichtlinearität 328. Die verborgene Schicht 312 ist im Beispiel zwischen der Eingabeschicht 318 und der Ausgabeschicht 326 des künstlichen neuronalen Netzwerks 316 angeordnet. Die Zustandsvariable 114 definiert den Zustand der verborgenen Schicht 312 im Zustandsraum.The artificial neural network 316 has the input layer 318 for data from digital images and an output layer 326 for the class 112 . The output layer 326 is in the example a classification layer comprising a linear layer followed by a softmax non-linearity 328 . The hidden layer 312 is in the example between the input layer 318 and the output layer 326 of the artificial neural network 316 arranged. The state variable 114 defines the state of the hidden layer 312 in the state space.

Das künstliche neuronale Netzwerk 316 kann optional eine zusätzliche Ausgabeschicht 330 für das Attribut 322 haben. Die Ausgabeschicht 330 ist im Beispiel eine weitere Klassifizierungsschicht, umfassend eine lineare Schicht, gefolgt von einer Softmax-Nichtlinearität 330. Die verborgene Schicht 324 ist im Beispiel zwischen der Eingabeschicht 318 und der Ausgabeschicht 330 des künstlichen neuronalen Netzwerks 316 angeordnet. Die Zustandsvariable 114 definiert den Zustand der verborgenen Schicht 324 im Zustandsraum.The artificial neural network 316 can optionally have an additional output layer 330 for the attribute 322 to have. The output layer 330 is in the Example another classification layer comprising a linear layer followed by a softmax non-linearity 330 . The hidden layer 324 is in the example between the input layer 318 and the output layer 330 of the artificial neural network 316 arranged. The state variable 114 defines the state of the hidden layer 324 in the state space.

Ein Verfahren für diese Vorrichtung entsprechend 3 wird Bezug nehmend auf 4 beschrieben.A method for this device accordingly 3rd is referring to 4th described.

Nach dem Start wird ein Schritt 402 ausgeführt.After starting, there will be a step 402 executed.

In Schritt 402 werden Daten des digitalen Bildes 102 und Kontextinformationen 104 für das digitale Bild 102 empfangen. Die Kontextinformationen 104 werden durch die Metadaten, Kontextdaten oder A-priori-Wissen über die Domäne/Szene definiert, die den Kontext des digitalen Bildes 102 beschreiben, insbesondere den Ort, die Zeit, das Datum und/oder die Sensordaten.In step 402 become data of the digital image 102 and context information 104 for the digital image 102 receive. The context information 104 are defined by the metadata, contextual data or a priori knowledge of the domain / scene that defines the context of the digital image 102 describe, in particular the location, the time, the date and / or the sensor data.

Danach wird ein Schritt 404 ausgeführt.After that is a step 404 executed.

In Schritt 404 wird die Darstellung 108 der Kontextinformationen für das digitale Bild 102 an der Einbettungsanordnung 106 in Abhängigkeit von den Kontextinformationen 104 bestimmt. Die Einbettungsanordnung 106 ist dazu trainiert, Darstellungen von Kontextinformationen von digitalen Bildern aus Kontextinformationen für die digitalen Bilder zu bestimmen. Die Elemente, die die Kontextinformationen darstellen, werden aus der Darstellung des Graphen 302 durch Abbilden der Kontextinformationen für digitale Bilder auf Elemente, die Kontextinformationen darstellen, bestimmt. In dem Beispiel wird der Untergraph 304, der die Elemente definiert, die die Kontextinformationen 104 für das digitale Bild darstellen, in Abhängigkeit von den Kontextinformationen 104 für das digitale Bild 102 bestimmt. Aus der Eingabe von Elementen, die die Kontextinformationen 104 darstellen, bestimmt das künstliche neuronale Netzwerk 306 den Vektor 308 als Darstellung 108 der Kontextinformationen 104.In step 404 becomes the representation 108 the context information for the digital image 102 at the embedding arrangement 106 depending on the context information 104 certainly. The embedding arrangement 106 is trained to determine representations of context information of digital images from context information for the digital images. The elements that represent the context information are derived from the representation of the graph 302 by mapping the context information for digital images to elements representing context information. In the example the subgraph 304 that defines the elements that make up the context information 104 for the digital image, depending on the context information 104 for the digital image 102 certainly. From the input of items that make up the context information 104 represent, determines the artificial neural network 306 the vector 308 as a representation 108 the context information 104 .

Danach wird der Schritt 406 ausgeführt.After that, the step 406 executed.

In Schritt 406 wird die Zustandsvariable der Klassifizierungsanordnung 110 in Abhängigkeit von der Eingabe von Daten des digitalen Bildes 102 und Eingabe der Darstellung 108 der Kontextinformationen 104 bestimmt, in diesem Beispiel des Vektors 308.In step 406 becomes the state variable of the classification arrangement 110 depending on the input of data of the digital image 102 and input of the representation 108 the context information 104 determined, in this example of the vector 308 .

In dem Beispiel wird der Vektor 308 durch die lineare Schicht geführt, gefolgt von einer Softmax-Nichtlinearität 310, d. h. einer Sigmoid-Nichtlinearität. Im Beispiel ist die Zustandsvariable 114 die Merkmalsabbildung der Aufmerksamkeitsschicht 312 des tiefen faltenden neuronalen Netzwerks 314. In dem Beispiel wird das Punktprodukt 314 des Vektors 308, der die lineare Schicht durchlaufen hat, gefolgt von der Softmax-Nichtlinearität 310, und der aus den Eingabedaten des digitalen Bildes 102 bestimmten Merkmalsabbildung angewendet, um den Ausgabevektor für die Klasse 112 zu bestimmen.In the example, the vector 308 passed through the linear slice, followed by a softmax non-linearity 310 , ie a sigmoid non-linearity. In the example, the status variable is 114 the feature mapping of the attention layer 312 of the deep folding neural network 314 . In the example, the dot product becomes 314 of the vector 308 that has passed through the linear slice, followed by the softmax nonlinearity 310 , and that from the input data of the digital image 102 specific feature mapping is applied to the output vector for the class 112 to determine.

Das bedeutet, dass die Zustandsvariable 114 der Klassifizierungsanordnung 110 in Abhängigkeit von der Darstellung 108 der Kontextinformationen 104 bestimmt wird. Die Zustandsvariable 114 der Klassifizierungsanordnung 110 wird in Abhängigkeit von den durch den Untergraphen 304 definierten Elementen bestimmt.That means that the state variable 114 the classification arrangement 110 depending on the presentation 108 the context information 104 is determined. The state variable 114 the classification arrangement 110 is dependent on the by the subgraph 304 defined elements.

Danach wird ein Schritt 408 ausgeführt.After that is a step 408 executed.

In Schritt 408 wird die Klasse 112 für das digitale Bild 102 als die Ausgabe an der Klassifizierungsanordnung 110 in Abhängigkeit von den Daten des digitalen Bildes 102 und in Abhängigkeit von der Darstellung 108 der Kontextinformationen 104 für das digitale Bild 102 bestimmt.In step 408 becomes the class 112 for the digital image 102 than the output on the classification device 110 depending on the data of the digital image 102 and depending on the presentation 108 the context information 104 for the digital image 102 certainly.

Optional wird ein Schritt 410 ausgeführt.Optional is a step 410 executed.

In Schritt 410 wird zumindest ein Attribut für das digitale Bild 102 in Abhängigkeit von den Daten des digitalen Bildes 102 und der Darstellung 108 der Kontextinformationen 104 für das digitale Bild 102 bestimmt.In step 410 becomes at least one attribute for the digital image 102 depending on the data of the digital image 102 and the representation 108 the context information 104 for the digital image 102 certainly.

Danach endet das Verfahren.Then the process ends.

5 stellt weitere Aspekte der Vorrichtung 100 für Objekterkennung dar. 5 represents further aspects of the device 100 for object recognition.

Dies bietet eine strukturell einfachere Alternative als die vorherige Anordnung. Zum Bereitstellen der Kontextinformationen für das faltende neuronale Netzwerk. In dieser Alternative wird jeder Eintrag des Kontextvektors auf die Höhe und Breite des Eingabebildes repliziert. Dieser replizierte Kontextvektor wird als zusätzliche Eingabekanäle an das faltende neuronale Netzwerk angekettet.This offers a structurally simpler alternative than the previous arrangement. To provide the context information for the convolutional neural network. In this alternative, each entry of the context vector is replicated to the height and width of the input image. This replicated context vector is chained to the folding neural network as additional input channels.

Die Einbettungsanordnung 106 entsprechend diesem Aspekt umfasst die Darstellung des Graphen 302, z. B. des Wissensgraphen, der Kontextinformationen für digitale Bilder auf Elemente, die Kontextinformationen darstellen, abbildet.The embedding arrangement 106 according to this aspect, the representation of the graph includes 302 , e.g. B. the knowledge graph, which maps context information for digital images to elements that represent context information.

Die Einbettungsanordnung 106 ist dazu angepasst, den Untergraphen 304, der die Elemente definiert, die die Kontextinformationen 104 für das digitale Bild 102 darstellen, in Abhängigkeit von den Kontextinformationen 104 für das digitale Bild 102 zu bestimmen. Die Einbettungsanordnung 106 kann das künstliche neuronale Netzwerk 306, z. B. das neuronale Graphen-Netzwerk, umfassen, das dazu trainiert ist, den Vektor 308 in Abhängigkeit von den Kontextinformationen 104 zu bestimmen.The embedding arrangement 106 is adapted to the subgraph 304 that defines the elements that make up the context information 104 for the digital image 102 represent, depending on the context information 104 for the digital image 102 to determine. The embedding arrangement 106 can the artificial neural network 306 , e.g. The graph neural network trained to include the vector 308 depending on the context information 104 to determine.

Die Klassifizierungsanordnung 110 umfasst ein künstliches neuronales Netzwerk 502 mit einer Eingabeschicht 504 für Daten von digitalen Bildern und eine Ausgabeschicht 506 für eine Ausgabe der Klasse 112.The classification arrangement 110 includes an artificial neural network 502 with an input layer 504 for data from digital images and an output layer 506 for an output of the class 112 .

Die Zustandsvariable 114 definiert einen Zustand der Eingabeschicht 504. Die Eingabeschicht 504 ist eine Kontextintegrationsschicht des künstlichen neuronalen Netzwerks 502 im Zustandsraum. Das künstliche neuronale Netzwerk 502 umfasst verborgene Schichten 512. Das künstliche neuronale Netzwerk 502 kann ein faltendes neuronales Netzwerk sein.The state variable 114 defines a state of the input layer 504 . The input layer 504 is a context integration layer of the artificial neural network 502 in the state space. The artificial neural network 502 includes hidden layers 512 . The artificial neural network 502 can be a folding neural network.

Die Klassifizierungsanordnung 110 ist dazu angepasst, die Kontexteinbettung durch Zusammenstapeln mit Bilddaten direkt in das künstliche neuronale Netzwerk 502 einzuspeisen. In dem Beispiel umfasst die Klassifizierungsanordnung 110 eine Erweiterungsanordnung 508, die dazu angepasst ist, den Vektor 308 zu erweitern, um einen Tensor zu bilden, der der Dimension der Bilddaten im Zustandsraum entspricht. In dem Beispiel umfasst die Klassifizierungsanordnung 110 eine Verkettungsanordnung 510, die dazu angepasst ist, den Tensor und die Bilddaten im Zustandsraum zu verketten. Die Ausgabeschicht 506 im Beispiel umfasst eine lineare Schicht zum Empfangen der Ausgabe der verborgenen Schichten 512, gefolgt von einer Softmax-Nichtlinearität 514 zum Bestimmen der Klasse 112.The classification arrangement 110 is adapted to the context embedding by stacking together with image data directly in the artificial neural network 502 to feed. In the example the classification arrangement comprises 110 an expansion arrangement 508 which is adapted to the vector 308 to form a tensor that corresponds to the dimension of the image data in the state space. In the example the classification arrangement comprises 110 a chaining arrangement 510 which is adapted to concatenate the tensor and the image data in the state space. The output layer 506 in the example comprises a linear layer for receiving the output of the hidden layers 512 followed by a softmax nonlinearity 514 to determine the class 112 .

6 stellt weitere Aspekte des Verfahrens zur Objekterkennung dar. 6th represents further aspects of the method for object recognition.

Nach dem Start wird ein Schritt 602 ausgeführt.After starting, there will be a step 602 executed.

In Schritt 602 werden Daten des digitalen Bildes 102 und Kontextinformationen 104 für das digitale Bild 102 empfangen, wie oben in Schritt 202 beschrieben.In step 602 become data of the digital image 102 and context information 104 for the digital image 102 received as in step above 202 described.

Danach wird ein Schritt 604 ausgeführt.After that is a step 604 executed.

In Schritt 604 wird die Darstellung 108 der Kontextinformationen für das digitale Bild 102 an der Einbettungsanordnung 106 bestimmt, wie oben in Schritt 304 beschrieben.In step 604 becomes the representation 108 the context information for the digital image 102 at the embedding arrangement 106 determined as in step above 304 described.

Danach wird ein Schritt 606 ausgeführt.After that is a step 606 executed.

In Schritt 606 wird zumindest eine Zustandsvariable 114 der Klassifizierungsanordnung 110 in Abhängigkeit von der Darstellung 108 der Kontextinformationen 104 bestimmt. Die Zustandsvariable 114 entsprechend diesem Aspekt definiert den Zustand der Eingabeschicht, z. B. der Kontextintegrationsschicht des künstlichen neuronalen Netzwerks 402 im Zustandsraum. Die Kontexteinbettung wird durch Zusammenstapeln mit Bilddaten direkt in das künstliche neuronale Netzwerk 402 eingespeist. In dem Beispiel wird der Vektor 308 erweitert, um den Tensor zu bilden, der der Dimension der Bilddaten im Zustandsraum entspricht. Der Tensor und die Bilddaten werden dann im Zustandsraum verkettet, um eine Eingabe für die verborgenen Schichten 412 des künstlichen neuronalen Netzwerks 402 zu bilden.In step 606 becomes at least one state variable 114 the classification arrangement 110 depending on the presentation 108 the context information 104 certainly. The state variable 114 according to this aspect defines the state of the input layer, e.g. B. the context integration layer of the artificial neural network 402 in the state space. Context embedding is done by stacking image data directly into the artificial neural network 402 fed in. In the example, the vector 308 expanded to form the tensor that corresponds to the dimension of the image data in the state space. The tensor and image data are then concatenated in state space to provide input for the hidden layers 412 of the artificial neural network 402 to build.

Danach wird ein Schritt 608 ausgeführt.After that is a step 608 executed.

In Schritt 608 wird die Klasse 112 für das digitale Bild 102 an der Klassifizierungsanordnung 110 in Abhängigkeit von den Daten des digitalen Bildes 102 und in Abhängigkeit von der Darstellung 108 der Kontextinformationen 104 für das digitale Bild 102 bestimmt. Die Klasse 112 des digitalen Bildes 102 wird als Ausgabe der Ausgabeschicht 406 des künstlichen neuronalen Netzwerks 402 bestimmt. Die Klasse 112 wird im Beispiel in Abhängigkeit von der Ausgabe der linearen Schicht, gefolgt von der Softmax-Nichtlinearität 414, bestimmt.In step 608 becomes the class 112 for the digital image 102 on the classification arrangement 110 depending on the data of the digital image 102 and depending on the presentation 108 the context information 104 for the digital image 102 certainly. The class 112 of the digital image 102 is called the output of the output layer 406 of the artificial neural network 402 certainly. The class 112 is in the example dependent on the output of the linear slice, followed by the Softmax non-linearity 414 , certainly.

Danach endet das Verfahren.Then the process ends.

7 stellt weitere Aspekte der Vorrichtung 100 für Objekterkennung dar. Die vorherigen Anordnungen und Verfahren ermöglichen dem faltenden neuronalen Netzwerk, sich an den Kontext anzupassen. Die erforderlichen Änderungen zum Einbinden des Kontexts sind allerdings nicht Teil von regulären faltenden neuronalen Netzwerkarchitekturen und erfordern, dass das so erweiterte faltende neuronale Netzwerk vollständig neu trainiert wird. 7th represents further aspects of the device 100 for object recognition. The previous arrangements and methods allow the convolutional neural network to adapt to the context. The changes required to incorporate the context, however, are not part of regular folding neural network architectures and require that the folding neural network expanded in this way be completely retrained.

Ein Ansatz zum Integrieren des Kontexts, ohne den Merkmalsextraktor des faltenden neuronalen Netzwerks neu trainieren zu müssen, ist, zuerst unter Verwendung eines vorab trainierten Backbones eines faltenden neuronalen Netzwerks eine Merkmalsdarstellung des Eingabebildes zu berechnen und den Kontextvektor nur in der abschließenden Klassifizierungsschicht zu integrieren. Zu diesem Zweck wird eine Lineartransformation auf den Kontextvektor angewendet, gefolgt von einer punktweisen Sigmoid-Nichtlinearität, und das Ergebnis wird mit dem Merkmalsvektor des durch das faltende neuronale Netzwerk berechneten Eingabebildes multipliziert (Gating-Mechanismus). Bei diesem Ansatz müssen nur die Gewichte in der Klassifizierungsschicht trainiert werden.One approach to integrating the context without having to retrain the feature extractor of the folding neural network is to first compute a feature representation of the input image using a pre-trained backbone of a folding neural network and only integrate the context vector in the final classification layer. For this purpose, a linear transformation is applied to the context vector, followed by a point-wise sigmoid non-linearity, and the result is multiplied by the feature vector of the input image calculated by the convolutional neural network (gating mechanism). With this approach only the weights in the classification layer need to be trained.

Die Klassifizierungsanordnung 110 umfasst ein künstliches neuronales Netzwerk 702. Das künstliche neuronale Netzwerk 702 umfasst in dem Beispiel eine Eingabeschicht, eine oder mehrere verborgene Schichten und eine Ausgabeschicht. Die Eingabeschicht und die eine oder die mehreren verborgenen Schichten sind in 7 mit Bezugszeichen 704 dargestellt. Die Ausgabeschicht umfasst eine Klassifizierungsschicht 706 zur Ausgabe der Klasse 112. Das künstliche neuronale Netzwerk 702 bestimmt aus digitaler Bildeingabe die Zustandsvariable 114 für die Klassifizierungsschicht 706. Die Zustandsvariable 114 kann eine Merkmalsabbildung einer verborgenen Schicht des künstlichen neuronalen Netzwerks 702 sein. Ein Gewichtsvektor 710 wird aus dem Vektor 308 in dem Beispiel durch eine lineare Schicht bestimmt, gefolgt von einer Softmax-Nichtlinearität 708, d. h. einer Sigmoid-Nichtlinearität.The classification arrangement 110 includes an artificial neural network 702 . The artificial neural network 702 comprises in the example an input layer, one or more hidden layers and an output layer. The input layer and the one or more hidden layers are in 7th with reference numerals 704 shown. The output layer comprises a classification layer 706 to output the class 112 . The artificial neural network 702 determines the status variable from digital image input 114 for the classification layer 706 . The state variable 114 can be a feature map of a hidden layer of the artificial neural network 702 be. A weight vector 710 becomes from the vector 308 in the example determined by a linear layer, followed by a softmax non-linearity 708 , ie a sigmoid non-linearity.

Die Klassifizierungsanordnung 110 umfasst eine elementweise Produktbestimmungsanordnung 712, die dazu angepasst ist, ein Ergebnis einer elementweisen Multiplikation einer Zustandsvariable 114' und des Gewichtsvektors 710 zu bestimmen, um die Zustandsvariable 114 zu bestimmen. Der Gewichtsvektor 710 in diesem Beispiel hat die gleiche Dimension wie die Zustandsvariable 114'. Die Klassifizierungsanordnung 110 umfasst eine lineare Schicht, gefolgt von einer Softmax-Nichtlinearität 714, d. h. Sigmoid, dazu angepasst, die Ausgabe, d. h. die Klasse 112 unter Verwendung der linearen flachen Schicht, gefolgt von der Softmax-Nichtlinearität 714, aus dem Ergebnis zu bestimmen.The classification arrangement 110 comprises an element-wise product determination arrangement 712 which is adapted to be a result of an element-wise multiplication of a state variable 114 ' and the weight vector 710 to determine the state variable 114 to determine. The weight vector 710 in this example has the same dimension as the state variable 114 ' . The classification arrangement 110 comprises a linear layer followed by a softmax non-linearity 714 , ie sigmoid, adapted to the output, ie the class 112 using the linear flat layer followed by the softmax non-linearity 714 to determine from the result.

Das künstliche neuronale Netzwerk 702 kann ein faltendes neuronales Netzwerk sein.The artificial neural network 702 can be a folding neural network.

8 stellt weitere Aspekte des Verfahrens zur Objekterkennung dar. 8th represents further aspects of the method for object recognition.

Nach dem Start wird ein Schritt 802 ausgeführt.After starting, there will be a step 802 executed.

In Schritt 802 werden Daten des digitalen Bildes 102 und Kontextinformationen 104 für das digitale Bild 102 empfangen, wie oben in Schritt 202 beschrieben.In step 802 become data of the digital image 102 and context information 104 for the digital image 102 received as in step above 202 described.

Danach wird ein Schritt 804 ausgeführt.After that is a step 804 executed.

In Schritt 804 wird die Darstellung 108 der Kontextinformationen für das digitale Bild 102 an der Einbettungsanordnung 106 bestimmt, wie oben in Schritt 304 beschrieben.In step 804 becomes the representation 108 the context information for the digital image 102 at the embedding arrangement 106 determined as in step above 304 described.

Danach wird ein Schritt 806 ausgeführt.After that is a step 806 executed.

In Schritt 806 wird der Gewichtsvektor 710 aus dem Vektor 308 bestimmt, insbesondere als Ausgabe der linearen Schicht, gefolgt von einer Softmax-Nichtlinearität 710, d. h. Sigmoid, in Reaktion auf den Vektor 308 als Eingabe in die lineare Schicht.In step 806 becomes the weight vector 710 from the vector 308 determined, in particular as the output of the linear slice, followed by a softmax non-linearity 710 , ie, sigmoid, in response to the vector 308 as input to the linear slice.

Danach wird ein Schritt 808 ausgeführt.After that is a step 808 executed.

In Schritt 808 wird zumindest eine Zustandsvariable 114 der Klassifizierungsanordnung 110 in Abhängigkeit von der Darstellung 108 der Kontextinformationen 104 bestimmt. Der Gewichtsvektor 710 gewichtet die Zustandsvariable 114 entsprechend diesem Aspekt. In dem Beispiel wird ein Ergebnis eines elementweisen Produkts durch eine elementweise Multiplikation der Zustandsvariable 114 und des Gewichtsvektors 710 bestimmt. Das Ergebnis wird in eine lineare Schicht eingegeben, gefolgt von einer Softmax-Nichtlinearität 714, d. h. Sigmoid, um die Klasse 112 aus dem Ergebnis zu bestimmen.In step 808 becomes at least one state variable 114 the classification arrangement 110 depending on the presentation 108 the context information 104 certainly. The weight vector 710 weights the state variable 114 according to this aspect. In the example, a result of an element-wise product is obtained by an element-wise multiplication of the state variable 114 and the weight vector 710 certainly. The result is entered into a linear slice, followed by a softmax non-linearity 714 , ie sigmoid to the class 112 to be determined from the result.

Danach endet das Verfahren.Then the process ends.

Ein Verfahren zum Trainieren wird nachfolgend Bezug nehmend auf 9 beschrieben.One method of training is provided below with reference to FIG 9 described.

Nach dem Start werden, in einem Schritt 902, Trainingsdaten bereitgestellt. Die Trainingsdaten umfassen Trainingsdatenpunkte. Jeder der Trainingsdatenpunkte umfasst Daten eines digitalen Bildes, Kontextinformationen für das digitale Bild und Informationen über die Klasse für das digitale Bild. Trainingsdaten werden zum Beispiel aus einer Datenbank bereitgestellt, die beschriftete Bilder umfasst.After starting it will be in one step 902 , Training data provided. The training data includes training data points. Each of the training data points includes data of a digital image, context information for the digital image, and information about the class for the digital image. Training data is provided, for example, from a database comprising labeled images.

Danach umfasst das Verfahren für jeden Trainingsdatenpunkt einen Schritt 904.The method then comprises one step for each training data point 904 .

In Schritt 904 wird für jeden Trainingsdatenpunkt die Darstellung 108 der Kontextinformationen 104 an der Einbettungsanordnung 106 in Abhängigkeit von den Kontextinformationen 104 bestimmt.In step 904 is the representation for each training data point 108 the context information 104 at the embedding arrangement 106 depending on the context information 104 certainly.

In Schritt 904 wird für jeden Trainingsdatenpunkt die eine Klasse 112 für das digitale Bild 102 an der Klassifizierungsanordnung 110 in Abhängigkeit von den Daten des digitalen Bildes 102 und von der Darstellung 108 der Kontextinformationen 104 bestimmt.In step 904 becomes one class for each training data point 112 for the digital image 102 on the classification arrangement 110 depending on the data of the digital image 102 and from the representation 108 the context information 104 certainly.

In Schritt 904 wird für jeden Trainingsdatenpunkt die Zustandsvariable 114 der Klassifizierungsanordnung 110 in Abhängigkeit von der Darstellung 108 der Kontextinformationen 104 bestimmt. Die Klasse 112 des digitalen Bildes 102 wird in Abhängigkeit von der Zustandsvariable 114 bestimmt.In step 904 becomes the state variable for each training data point 114 the classification arrangement 110 depending on the presentation 108 the context information 104 certainly. The class 112 of the digital image 102 is dependent on the status variable 114 certainly.

Danach wird, in einem Schritt 906, ein Parameter für die Klassifizierungsanordnung 110 und/oder die Einbettungsanordnung 106 in einem Gradientenverfahren in Abhängigkeit von mehreren Trainingsdatenpunkten und von für die mehreren Trainingsdatenpunkte bestimmten Klassen bestimmt. Das faltende neuronale Netzwerk der Klassifizierungsanordnung 110 kann ohne die Gewichtung durch die Darstellung 108 der Kontextinformationen 104 vorab trainiert werden. Wie oben erwähnt, können, in diesem Fall, alle Parameter mit der Gewichtung für die in 3 und 5 dargestellten Beispiele erneut trainiert werden, oder das vorab trainierte faltende neuronale Netzwerk wird verwendet, und nur die Parameter der Klassifizierungsschicht 706 werden für das Beispiel aus 7 erneut trainiert.After that, in one step 906 , a parameter for the classification arrangement 110 and / or the embedding arrangement 106 determined in a gradient method as a function of a plurality of training data points and of classes determined for the plurality of training data points. The folding neural network of the classification arrangement 110 can without the weighting through the representation 108 the context information 104 be trained in advance. As mentioned above, in this case, all parameters with the weighting for the in 3rd and 5 illustrated examples can be retrained, or the previously trained convolutional neural network is used, and only the parameters of the classification layer 706 are made for the example 7th trained again.

Um sicherzustellen, dass die Merkmalsdarstellung des durch das faltende neuronale Netzwerk berechneten Eingabebildes konsistent mit im Wissensgraph enthaltenen Informationen ist, wird bevorzugt, das faltende neuronale Netzwerk nicht nur zu trainieren, um den spezifischen Objekttyp vorherzusagen, sondern auch Attribute des Objekts, die in dem Wissensgraph dargestellt werden, wie für das Beispiel aus 3 beschrieben. Attribute können z. B. Farbe oder Form sein. Hinzufügen und Trainieren zusätzlicher Klassifizierungsschichten für diese Attribute, d. h. Lineartransformationen, gefolgt von einer Softmax-Nichtlinearität, kann dies erreichen.In order to ensure that the feature representation of the input image computed by the folding neural network is consistent with information contained in the knowledge graph, it is preferred to train the folding neural network not only to predict the specific type of object, but also attributes of the object contained in the knowledge graph as shown for the example 3rd described. Attributes can e.g. B. color or shape. Adding and training additional classification layers for these attributes, i.e. linear transformations followed by softmax non-linearity, can accomplish this.

Danach endet das Verfahren.Then the process ends.

Ein beliebiges der oben beschriebenen künstlichen neuronalen Netzwerke kann durch dieses Verfahren trainiert werden. Das so trainierte künstliche neuronale Netzwerk kann als trainierte Anordnung verwendet werden, wie oben beschrieben.Any of the artificial neural networks described above can be trained by this method. The artificial neural network thus trained can be used as a trained arrangement as described above.

Claims

A method for classifying digital image data, in particular for object detection, characterized by receiving (202, 402, 602, 802) data from a digital image (102) and context information (104) for the digital image (102), determining (204, 404, 604 , 804) a representation (108) of the context information for the digital image on an embedding arrangement (106) as a function of the context information (104), the embedding arrangement (106) being trained, representations of context information from digital images from context information for the digital images to determine, determining (208, 408, 608, 808) a class (112) for the digital image on a classification arrangement (110) as a function of the data of the digital image (102) and as a function of the representation (108) of the context information (104) for the digital image (102), wherein the classification arrangement (110) is trained to classify digital images as a function of data from digita len images and representations of context information for digital images, wherein at least one state variable (114) of the classification arrangement (110) is determined (206, 406, 606, 806) as a function of the representation (108) of the context information (104) and where the class (112) of the digital image (102) is determined (208, 408, 608, 808) as a function of the at least one state variable (114).

Procedure according to Claim 1 , characterized in that the context information (104) is defined by metadata, context data or a priori knowledge that describe the context of the digital image (102), in particular a location, a time, a date and sensor data.

Procedure according to Claim 1 or 2 , characterized in that the embedding arrangement (106) comprises a representation of a graph (302) that maps context information for digital images onto elements that represent context information, wherein a subgraph (304) that defines elements that the context information (104) for the digital image, is determined (404) as a function of the context information (104) for the digital image (102) and wherein the state variable (114) of the classification arrangement (110) as a function of the elements defined by the subgraph (304) is determined (406).

Method according to one of the Claims 1 to 3rd , characterized in that the representation (108) of the context information (104) is a vector (308), in particular with a predetermined dimension, in a vector space, the embedding arrangement (106) comprising an artificial neural network (306) that is trained to the vector (308) as a function of the context information (104).

Method according to one of the preceding claims, characterized in that the classification arrangement (110) comprises an artificial neural network (316) with an input layer (318) for data from digital images and an output layer (328) for the class (112), the State variable (114) defines a state of a hidden layer (312) in a state space, the hidden layer (312) being arranged between the input layer (318) and the output layer (328) of the artificial neural network (316).

Method according to one of the preceding claims, characterized in that at least one attribute (322) for the digital image is dependent on the data of the digital image (102) and the representation (108) of the context information (104) for the digital image (102) is determined (410), wherein the classification arrangement (110) is trained to determine attributes for digital images as a function of data from digital images and context information for digital images.

Method according to one of the Claims 1 to 4th , characterized in that the classification arrangement (110) comprises an artificial neural network (502) with an input layer (504) for data from digital images and an output layer (506) for output of the class (112), the state variable (114) being a The state of the input layer (504) of the artificial neural network (502) is defined in the state space.

Method according to one of the Claims 1 to 4th , characterized in that the classification arrangement (110) comprises an artificial neural network (704) with an input layer for data from digital images and an output layer (706) for output of the class (112), the state variable (114) a state of the output layer (706) of the artificial neural network (702) is defined in the state space.

The method according to any one of the preceding claims, characterized in that training data is provided (902), the training data comprising training data points, each of the training data points comprising data of a digital image, context information for the digital image and information about the class for the digital image, wherein The method comprises for each training data point at an embedding arrangement the representation of the context information as a function of the context information and at the classification arrangement a class for the digital image as a function of the data of the digital image and the representation of the context information to be determined (904), wherein at least a state variable (114) of the classification arrangement is determined as a function of the representation of the context information, and wherein the class of the digital image is determined as a function of the at least one state variable, one parameter r is determined for the classification arrangement and / or the embedding arrangement in a gradient method as a function of several training data points and classes determined for the several training data points (906).

Device (100) for classifying digital image data, in particular for object recognition, characterized in that the device comprises a classification arrangement (110) and an embedding arrangement (106) which are adapted to the method according to one of the Claims 1 to 9 to execute.

Computer program, characterized in that the computer program comprises computer-readable instructions which, when executed by a computer, cause the computer to implement the method according to one of the Claims 1 to 10 to execute.

Computer program product, characterized by a non-volatile machine-readable memory, which the computer program after Claim 11 includes.