DE102019127306A1

DE102019127306A1 - System and method for detecting objects in a three-dimensional environment of a carrier vehicle

Info

Publication number: DE102019127306A1
Application number: DE102019127306.4A
Authority: DE
Inventors: Mahmoud Elkhateeb; Mohamed Zahran; Ahmad El-Sallab
Original assignee: Valeo Schalter und Sensoren GmbH
Current assignee: Valeo Schalter und Sensoren GmbH
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2021-04-15

Abstract

System zum Erfassen eines Objekts in einer dreidimensionalen Umgebung eines Trägerfahrzeugs, wobei das System umfasst: eine Sensoreinheit (11), die konfiguriert ist, um eine Punktwolke (15) bereitzustellen, die eine dreidimensionale Umgebung des Fahrzeugs darstellt, eine Verarbeitungseinheit (12) zum Ausführen zumindest der Aufgabe der Objekterkennung, wobei die Verarbeitungseinheit (12) einen Codierer (13) und einen Decodierer (14) für die Aufgabe der Objekterkennung umfasst, wobei der Codierer (13) derart konfiguriert ist, die Punktwolke (15) als Eingabedaten zu empfangen, Merkmale, die zum Ausführen der Aufgaben der Objekterfassung erforderlich sind, basierend auf einem tiefen neuronalen Netzwerk von den Eingabedaten zu extrahieren und dem Decodierer (14) die extrahierten Merkmale zuzuführen, wobei der Decodierer (14) ein 3D-Vorschlagsnetzwerk ist, das zum Empfangen von Punktwolkendaten (14) als Eingabedaten zum Erzeugen von 3D-Objektvorschlägen konfiguriert ist.Die vorliegende Erfindung bezieht sich auch auf ein Verfahren zum Erfassen eines Objekts in einer dreidimensionalen Umgebung eines Trägerfahrzeugs und auf ein Computerprog rammprodu kt.A system for detecting an object in a three-dimensional environment of a host vehicle, the system comprising: a sensor unit (11) configured to provide a point cloud (15) representing a three-dimensional environment of the vehicle, a processing unit (12) for execution at least the task of object recognition, wherein the processing unit (12) comprises an encoder (13) and a decoder (14) for the task of object recognition, the encoder (13) being configured to receive the point cloud (15) as input data, To extract features necessary to carry out the object detection tasks based on a deep neural network from the input data and to supply the extracted features to the decoder (14), the decoder (14) being a 3-D suggestion network which is used to receive Point cloud data (14) is configured as input data for generating 3D object proposals The invention also relates to a method for detecting an object in a three-dimensional environment of a carrier vehicle and to a computer program product.

Description

Die vorliegende Erfindung bezieht sich auf ein System zum Erfassen eines Objekts in einer dreidimensionalen Umgebung eines Trägerfahrzeugs. Die vorliegende Erfindung bezieht sich auch auf ein Verfahren zum Erfassen eines Objekts in einer dreidimensionalen Umgebung eines Trägerfahrzeugs. Darüber hinaus bezieht sich die vorliegende Erfindung auf ein Computerprogrammprodukt.The present invention relates to a system for detecting an object in a three-dimensional environment of a carrier vehicle. The present invention also relates to a method for detecting an object in a three-dimensional environment of a carrier vehicle. The present invention also relates to a computer program product.

In Kraftfahrzeuganwendungen, wie beispielsweise bei der Hinderniserfassung und - vermeidung bei autonomem Fahren oder adaptiver Frontbeleuchtung, wird die dreidimensionale Umgebung eines Fahrzeugs überwacht. Zum Überwachen der Umgebung ist das Fahrzeug typischerweise mit geeigneten Sensoren in Form von 3D-Scannern ausgestattet, wie beispielsweise mit sogenannten Lidar- (Light Detection and Ranging) Sensoren oder Radarsensoren. Bei der Lichterfassung und der Entfernungsmessung wird der Abstand zu Objekten bestimmt, indem die Umgebung und damit die darin befindlichen Objekte mit gepulstem Laserlicht beleuchtet werden und das reflektierte Laserlicht erfasst wird. Die Rücklaufzeit des Laserlichts ist ein Maß für den Abstand zur Oberfläche eines Objekts in der Umgebung. Eine Intensität der Reflexion kann verarbeitet werden, um weitere Information in Bezug auf eine Oberfläche bereitzustellen, die das Laserlicht reflektiert.In motor vehicle applications, such as obstacle detection and avoidance in autonomous driving or adaptive front lighting, the three-dimensional environment of a vehicle is monitored. To monitor the surroundings, the vehicle is typically equipped with suitable sensors in the form of 3D scanners, such as so-called lidar (light detection and ranging) sensors or radar sensors. With light detection and distance measurement, the distance to objects is determined by illuminating the surroundings and thus the objects located therein with pulsed laser light and detecting the reflected laser light. The return time of the laser light is a measure of the distance to the surface of an object in the vicinity. An intensity of the reflection can be processed to provide further information regarding a surface that is reflecting the laser light.

Mit einem 3D-Scanner wird ein Satz von Datenpunkten im dreidimensionalen Raum erzeugt, der als Punktwolke bezeichnet wird. Eine Punktwolke ist eine geometrische Datenstruktur. Jeder (Daten-) Punkt der Punktwolke entspricht einem physischen Punkt auf der Außenfläche eines Objekts in der Umgebung eines Fahrzeugs und hat typischerweise die Koordinaten X, Y und Z des physischen Punkts in einem dreidimensionalen kartesischen Koordinatensystem plus optionale zusätzliche Merkmale wie Farbe, Normalität usw. Ein 3D-Scanner gibt typischerweise die gemessene Punktwolke als Datenstruktur oder Datendatei aus. 1 zeigt ein Beispiel einer Punktwolke, die erhalten wird, wenn die dreidimensionale Umgebung eines Fahrzeugs durch einen Lidar-Sensor abgetastet wird. Im Allgemeinen sind Punktwolken nicht auf ein dreidimensionales Koordinatensystem beschränkt, sondern können eine höhere oder eine niedrigere Dimension aufweisen.A 3D scanner creates a set of data points in three-dimensional space called a point cloud. A point cloud is a geometric data structure. Each (data) point of the point cloud corresponds to a physical point on the outer surface of an object in the vicinity of a vehicle and typically has the coordinates X, Y and Z of the physical point in a three-dimensional Cartesian coordinate system plus optional additional features such as color, normality, etc. A 3D scanner typically outputs the measured point cloud as a data structure or data file. 1 Fig. 10 shows an example of a point cloud obtained when the three-dimensional environment of a vehicle is scanned by a lidar sensor. In general, point clouds are not limited to a three-dimensional coordinate system, but can have a higher or a lower dimension.

Die Objekterkennung ist eine sehr kritische Aufgabe für das autonome Fahren. Da das Auto die Straße mit vielen Fahrzeugen oder Fußgängern teilt, muss es die gesamte Umgebung berücksichtigen. Dieses Bewusstsein ist für viele Automotive-Aufgaben wie die Kollisionsvermeidung notwendig.Object recognition is a very critical task for autonomous driving. Since the car shares the road with many vehicles or pedestrians, it has to take the entire environment into account. This awareness is necessary for many automotive tasks such as collision avoidance.

Um die Umgebung zu verstehen, ist es wichtig, die darin befindlichen Objekte zu erfassen, jeden Punkt eines Objekts semantisch zu segmentieren und die Objekte zu klassifizieren. Objekterfassung, semantische Segmentierung und Klassifizierung sind als drei grundlegende Probleme/Aufgaben für ein Szenenverständnis in Computer Vision bekannt. Die Aufgabe der Objekterfassung besteht darin, alle Objekte vordefinierter Kategorien in einer Punktwolke zu identifizieren und sie mit orientierten Begrenzungsrahmen (sogenannten dreidimensionalen orientierten Begrenzungsrahmen - 3D OBB) zu lokalisieren/zu umschließen. Die Aufgabe der semantischen Segmentierung arbeitet mit einem feineren Maßstab als die Objekterfassung. Das Ziel der semantischen Segmentierung besteht darin, jedes Objekt zu zergliedern und jedem Punkt des Objekts eine Klassenkennzeichnung zuzuordnen. Während zum Beispiel bei der Objekterfassung ein Rahmen um einen erfassten Motorradfahrer und sein Motorrad gelegt wird, wird bei der semantischen Segmentierung den Punkten, die das Motorrad darstellen, eine Klassenkennzeichnung (Motorrad) zugeordnet, während den Punkten, die den Motorradfahrer darstellen, eine andere Klassenkennzeichnung (Motorradfahrer) zugeordnet wird. Die Klassifizierung zielt andererseits darauf ab, Objekte zu identifizieren und jedem Objekt eine Klassenkennzeichnung zuzuordnen, wie beispielsweise Baum oder Auto. In Computer Vision werden Objekterfassung, semantische Segmentierung und Klassifizierung als drei verschiedene Aufgaben behandelt, die normalerweise mit völlig unterschiedlichen Ansätzen gelöst werden.In order to understand the environment, it is important to understand the objects located in it, to semantically segment each point of an object and to classify the objects. Object detection, semantic segmentation and classification are known as three basic problems / tasks for scene understanding in computer vision. The task of object detection is to identify all objects of predefined categories in a point cloud and to locate / enclose them with oriented bounding frames (so-called three-dimensional oriented bounding frames - 3D OBB). The task of semantic segmentation works on a finer scale than object detection. The goal of semantic segmentation is to break down each object and assign a class identifier to each point of the object. While, for example, a frame is placed around a registered motorcyclist and his motorcycle in object detection, in semantic segmentation the points that represent the motorcycle are assigned a class identifier (motorcycle), while the points that represent the motorcyclist are assigned a different class identifier (Motorcyclists) is assigned. On the other hand, the classification aims to identify objects and assign a class identifier to each object, such as a tree or a car. In Computer Vision, object detection, semantic segmentation, and classification are treated as three different tasks that are usually solved using completely different approaches.

Aufgrund der typischen Struktur einer Fahrzeugumgebung haben durch 3D-Scanner ausgegebene Umgebungspunktwolken normalerweise keine regelmäßige Form. Tiefe neuronale Netzwerke, wie z.B. konvolutionelle neuronale Netzwerke, die üblicherweise zur Analyse visueller Bilder verwendet werden, erfordern typischerweise Eingabedaten mit hochgradig regulären Formaten, wie beispielsweise solche von Bildgittern oder dreidimensionalen Voxeln, um Operationen wie z.B. Weight-Sharing und andere Kernel-Optimierungen auszuführen. Ein tiefes neuronales Netzwerk (DNN) ist ein künstliches neuronales Netzwerk mit mehreren verborgenen Schichten zwischen der Eingabeschicht und der Ausgabeschicht. Ein konvolutionelles neuronales Netzwerk (CNN) ist ein spezifischer Typ eines tiefen künstlichen neuronalen Feedforward-Netzwerks, das eine Variation von mehrschichtigen Perzeptronen verwendet, die derart gestaltet sind, dass eine minimale Vorverarbeitung erforderlich ist. Die verborgenen Schichten eines konvolutionellen neuronalen Netzwerks weisen typischerweise konvolutionelle Schichten, Pooling-Schichten, Fully-Connected-Schichten, Normalisierungsschichten und dergleichen auf. Um eine Punktwolke mittels einer tiefen neuronalen Netzwerkarchitektur zu analysieren, wird der Satz von Punkten einer Punktwolken daher typischerweise in reguläre 3D-Voxelgitter oder Sammlungen von Bildern, die auch als Ansichten bezeichnet werden, umgewandelt, bevor sie der Eingabeschicht des tiefen neuronalen Netzwerks zugeführt werden. Eine solche Umwandlung des Satzes von Punkten der Punktwolke führt jedoch zu unnötig umfangreichen Datensätzen, während darüber hinaus Quantisierungsartefakte eingeführt werden, die natürliche Invarianzen des Satzes von Punkten der Punktwolke überdecken könnten.Due to the typical structure of a vehicle environment, environment point clouds output by 3D scanners usually do not have a regular shape. Deep neural networks, such as convolutional neural networks commonly used to analyze visual images, typically require input data with highly regular formats, such as those from image grids or three-dimensional voxels, in order to perform operations such as weight sharing and other kernel optimizations. A deep neural network (DNN) is an artificial neural network with several hidden layers between the input layer and the output layer. A convolutional neural network (CNN) is a specific type of deep artificial feedforward neural network that uses a variation of multilayer perceptrons designed to require minimal preprocessing. The hidden layers of a convolutional neural network typically include convolutional layers, pooling layers, fully connected layers, normalization layers, and the like. In order to analyze a point cloud using a deep neural network architecture, the set of points of a point cloud is therefore typically converted into regular 3D voxel grids or Collections of images, also known as views, are converted before being fed to the input layer of the deep neural network. Such a conversion of the set of points of the point cloud, however, leads to unnecessarily large data sets, while, in addition, quantization artifacts are introduced which could cover up natural invariances of the set of points of the point cloud.

In „PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation“ von Charles R. Qi, Hao Su, Kaichun Mo und Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, 2. Dezember 2016, ist ein neuartiger Typ eines tiefen neuronalen Netzwerks, ein sogenanntes PointNet, beschrieben, das in der Lage ist, Punktwolken, wie sie beispielsweise durch einen Lidar-Sensor erzeugt werden, direkt zu verwenden, und das die inhärente Permutationsinvarianz der Punkte der Punktwolke in den Eingabedaten des PointNet geeignet berücksichtigt. 2, die aus der vorstehend erwähnten Referenz genommen ist, zeigt die typische Architektur eines PointNet 1. Wie aus 2 ersichtlich ist, weist das PointNet 1 ein Netzwerk für jede Aufgabe auf, wobei die Aufgaben in diesem spezifischen Fall Klassifizierung und semantische Segmentierung sind. Das heißt, das PointNet 1 weist ein Klassifizierungsnetzwerk 2 und ein Segmentierungsnetzwerk 3 auf, die einen großen Teil der Strukturen gemeinsam nutzen. Das Klassifizierungsnetzwerk 2 empfängt n Punkte einer Punktwolke als Eingabedaten, wendet Eingabe- und Merkmalstransformationen 4, 5 auf die Eingabedaten an und sammelt dann Punktmerkmale durch max-Pooling, d.h. durch eine max-Pooling-Schicht 6, als eine symmetrische Funktion zum Sammeln von Information von allen n Punkten in den globalen Merkmalen 7. Max-Pooling verwendet den Maximalwert jedes Neurons eines Clusters von Neuronen in der vorherigen Schicht. Die Ausgabedaten/Ausgabe-Score-Werte des Klassifizierungsnetzwerks 2 sind Klassifizierungs-Score-Werte für k verschiedene Klassen. Das Segmentierungsnetzwerk 3 des PointNet 1 ist tatsächlich eine Erweiterung des Klassifizierungsnetzwerks 2, das globale Merkmale 7 und lokale Merkmale 8 (Pro-Punkt-Merkmale) miteinander verknüpft, neue Pro-Punkt-Merkmale 9 basierend auf den kombinierten Punktmerkmalen extrahiert und Pro-Punkt-Score-Werte als Ausgabe-Score-Werte ausgibt, d.h. nxm Score-Werte für jeden der n Punkte und jede von m vordefinierten semantischen Unterkategorien. Die neuen Pro-Punkt-Merkmale 9 beinhalten sowohl lokale als auch globale Information.In "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation" by Charles R. Qi, Hao Su, Kaichun Mo and Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, December 2, 2016, there is a new type of deep neural network, a so-called PointNet, which is able to directly use point clouds, such as those generated by a lidar sensor, for example, and which takes into account the inherent permutation invariance of the points of the point cloud in the input data of the PointNet. 2 , taken from the aforementioned reference, shows the typical architecture of a PointNet 1. As shown in FIG 2 As can be seen, the PointNet 1 has a network for each task, the tasks in this specific case being classification and semantic segmentation. That is, the PointNet 1 has a classification network 2 and a segmentation network 3 who share a large part of the structures. The classification network 2 receives n points of a point cloud as input data, applies input and feature transformations 4th , 5 on the input data and then collects point features by max-pooling, that is, by a max-pooling layer 6, as a symmetrical function for collecting information from all n points in the global features 7th . Max pooling uses the maximum value of each neuron of a cluster of neurons in the previous layer. The output data / output score values of the classification network 2 are classification score values for k different classes. The segmentation network 3 of PointNet 1 is actually an extension of the classification network 2 , the global characteristics 7th and local characteristics 8th (Per point characteristics) linked, new per point characteristics 9 extracted based on the combined point features and outputs per point score values as output score values, ie nxm score values for each of the n points and each of m predefined semantic subcategories. The new per-point features 9 contain both local and global information.

In 2 steht die Abkürzung „mlp“ für mehrschichtiges Perzeptron, und die Zahlen in Klammern geben die Schichtgrößen an. Die Abkürzung „T-Net“ bezeichnet ein Mini-Netzwerk, durch das eine affine Transformationsmatrix vorhergesagt und direkt auf die Koordinaten der Eingabepunkte/Punkte der Punktwolke angewendet wird (vergl. die Eingabetransformation 4 in 2) oder durch die eine Merkmalstransformationsmatrix vorhergesagt wird, um Merkmale von verschiedenen Eingabepunktwolken auszurichten (vergl. die Merkmalstransformation 5 in 2). Das T-Net selbst ähnelt dem großen Netzwerk und besteht aus Basismodulen von punktunabhängigen Merkmalsextraktions-, max-Pooling- und Fully-Connected-Schichten. Es wird eine Batch-Normalisierung für alle Schichten mit einer Rectified Linear Unit (ReLU) als die Aktivierungsfunktion verwendet. Im Klassifizierungsnetzwerk 2 werden Dropout-Schichten für das letzte mehrschichtige Perzeptron (mlp) verwendet. Im Allgemeinen weist das PointNet 1 drei Hauptmodule auf: eine max-Pooling-Schicht als eine symmetrische Funktion zum Sammeln von Information von allen Punkten eines Punktsatzes, eine lokale und globale Informationskombinationsstruktur sowie zwei verbundene Ausrichtungsnetzwerke, die sowohl Eingabepunkte als auch Punktmerkmale ausrichten. Siehe „PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation“ von Charles R. Qi, Hao Su, Kaichun Mo und Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, 2. Dezember 2016 , das hinsichtlich Details bezüglich der Netzwerkarchitektur des PointNet 1 hierin durch Bezugnahme aufgenommen ist.In 2 the abbreviation “mlp” stands for multilayer perceptron, and the numbers in brackets indicate the layer sizes. The abbreviation “T-Net” describes a mini-network through which an affine transformation matrix is predicted and applied directly to the coordinates of the input points / points of the point cloud (see input transformation 4th in 2 ) or by which a feature transformation matrix is predicted to align features from different input point clouds (see feature transformation 5 in 2 ). The T-Net itself is similar to the large network and consists of basic modules of point-independent feature extraction, max-pooling and fully connected layers. Batch normalization is used for all layers with a Rectified Linear Unit (ReLU) as the activation function. In the classification network 2 Dropout layers are used for the final multilayer perceptron (mlp). In general, the PointNet 1 has three main modules: a max-pooling layer as a symmetrical function for collecting information from all points of a point set, a local and global information combination structure and two connected alignment networks that align both input points and point features. See "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation" by Charles R. Qi, Hao Su, Kaichun Mo and Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, December 2, 2016 which is incorporated herein by reference for details regarding the network architecture of PointNet 1.

Um auch lokale Strukturen zu erfassen, die durch die metrischen Raumpunkte einer Punktwolke inhärent hervorgerufen werden, wird in „PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space“ von Charles R. Qui, Li Yi, Hao Su und Leonidas J. Guibas, arXiv Preprint, arXiv: 1706.02413, 7. Juni 2017 , die Verwendung eines hierarchischen neuronalen Netzwerks mit der Bezeichnung PointNet₊₊ für die Klassifizierung und semantische Segmentierung vorgeschlagen. PointNet₊₊ wendet das vorstehend beschriebene PointNet rekursiv auf eine verschachtelte Partitionierung des Eingabepunktsatzes an. Dadurch ist PointNet₊₊ in der Lage, lokale Merkmale mit zunehmenden kontextabhängigen Maßstäben durch Ausnutzung der metrischen Raumabstände zu lernen. In PointNet₊₊ können Merkmale von mehreren Maßstäben adaptiv kombiniert werden.In order to also capture local structures that are inherently caused by the metric spatial points of a point cloud, "PointNet ++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space “by Charles R. Qui, Li Yi, Hao Su and Leonidas J. Guibas, arXiv Preprint, arXiv: 1706.02413, June 7, 2017 who proposed the use of a hierarchical neural network called PointNet ₊₊ for classification and semantic segmentation. PointNet ₊₊ recursively applies the PointNet described above to a nested partitioning of the input point set. As a result, PointNet _{++ is} able to learn local features with increasing context-dependent scales by utilizing the metric spacing. In PointNet ₊₊ , features of several scales can be combined adaptively.

In beiden Ansätzen, dem ursprünglichen PointNet und PointNet₊₊, wird das Training für Klassifizierung und semantische Segmentierung jedoch nicht parallel ausgeführt. Das heißt, sowohl PointNet als auch PointNet₊₊ profitieren nicht von der induktiven Übertragung von Information, die durch Multi-Task-Lernen erhalten wird.In both approaches, the original PointNet and PointNet ₊₊ , the training for classification and semantic segmentation is not carried out in parallel. That is, both PointNet and PointNet ₊₊ do not benefit from the inductive transfer of information obtained through multi-task learning.

In letzter Zeit konzentrieren sich viele der Objekterkennungsalgorithmen hauptsächlich auf die 2D-Objekterkennung in Bildern und es wird eine sehr hohe Genauigkeit erreicht, aber es fehlt das Verständnis der 3D-Struktur der Objekte. Gleichzeitig gibt es einige bildbasierte Methoden, die typischerweise zuerst 3D-Boxenvorschläge erzeugen und dann eine regionsbasierte Erkennung mit der schnelle R-CNN Pipeline durchführen.Recently, many of the object detection algorithms are mainly focused on 2D object detection in images and very high accuracy is achieved but the understanding is lacking the 3D structure of the objects. At the same time, there are some image-based methods that typically generate 3D box suggestions first and then perform region-based recognition with the fast R-CNN pipeline.

Das Dokument „Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.“ von Ren, Shaoqing & He, Kaiming & Girshick, Ross & Sun, Jian, (2015), IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 10.1109/TPAMI.2016.2577031, zeigt, dass moderne Objekterkennungsnetzwerke auf Algorithmen zur Hypothesenbildung von Objektpositionen angewiesen sind. Fortschritte wie SPPnet und schnelle R-CNN haben die Laufzeit dieser Erkennungsnetzwerke reduziert und die Berechnung von Regionsvorschlägen als Engpass entlarvt. In dem Dokument wird ein Region Proposal Network (RPN) vorgestellt, das konvolutionelle Merkmale im Vollbild mit dem Detektionsnetzwerk teilt und so nahezu kostenfreie Regionsvorschläge ermöglicht. Ein RPN ist ein vollständig konvolutionelles Netzwerk, das gleichzeitig Objektgrenzen und Objektbewertungen an jeder Position vorhersagt. Das RPN wird von Anfang bis Ende geschult, um qualitativ hochwertige Regionsvorschläge zu erstellen, die vom schnellen R-CNN zur Erkennung verwendet werden.The document "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." By Ren, Shaoqing & He, Kaiming & Girshick, Ross & Sun, Jian, (2015), IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 10.1109 / TPAMI.2016.2577031, shows that modern object recognition networks rely on algorithms for hypothesizing object positions. Advances such as SPPnet and fast R-CNN have reduced the runtime of these detection networks and exposed the calculation of suggested regions as a bottleneck. In the document, a Region Proposal Network (RPN) is presented, which shares convolutional features in full screen with the detection network and thus enables almost free region proposals. An RPN is a fully convolutional network that simultaneously predicts object boundaries and object ratings at every position. The RPN is trained from start to finish to create high quality region suggestions that the fast R-CNN will use for discovery.

Neuere Lidar-basierte Methoden platzieren 3D-Fenster in 3D-Voxel-Gittern, um die Punktwolke zu bewerten, oder wenden konvolutionelle Netzwerke auf die Frontsichtkarte in einem dichten Box-Vorhersage-Schema an. Verfahren, die auf der Lidarpunktwolke basieren, erreichen in der Regel genauere 3D-Positionen, während bildbasierte Verfahren eine höhere Genauigkeit in Bezug auf die Auswertung von 2D-Boxen aufweisen. Für die anspruchsvollere Aufgabe der 3D-Objekterkennung ist jedoch ein gut durchdachtes Modell erforderlich, um die Stärke der 3D-Punktwolkendaten zu nutzen. Aufgrund dieser Komplexität verwenden einige Ansätze, wie z.B. PointNet oder PointNet₊₊, verbundene Komponenten mit Segmentierungswerten, um Objektvorschläge in Szenen zu erhalten.Newer lidar-based methods place 3D windows in 3D voxel grids to assess the point cloud or apply convolutional networks to the front view map in a dense box prediction scheme. Methods based on the lidar point cloud generally achieve more precise 3D positions, while image-based methods have a higher level of accuracy with regard to the evaluation of 2D boxes. However, the more sophisticated task of 3D object recognition requires a well-designed model to take advantage of the strength of the 3D point cloud data. Because of this complexity, some approaches, such as PointNet or PointNet ₊₊ , use connected components with segmentation values in order to receive object suggestions in scenes.

Es ist ein Gegenstand der vorliegenden Erfindung, ein System und ein Verfahren zum Erfassen von Objekten in einer dreidimensionalen Umgebung eines Trägerfahrzeugs mit verbesserter 3D-Objekterkennung bereitzustellen. Dieses Ziel wird durch die unabhängigen Ansprüche erreicht. Vorteilhafte Ausführungsformen sind in den abhängigen Ansprüchen angegeben.It is an object of the present invention to provide a system and a method for detecting objects in a three-dimensional environment of a carrier vehicle with improved 3D object recognition. This goal is achieved by the independent claims. Advantageous embodiments are given in the dependent claims.

Insbesondere stellt die vorliegende Erfindung ein System zum Erfassen eines Objekts in einer dreidimensionalen Umgebung eines Trägerfahrzeugs bereit, wobei das System umfasst: eine Sensoreinheit, die konfiguriert ist, um eine Punktwolke bereitzustellen, die eine dreidimensionale Umgebung des Fahrzeugs darstellt, eine Verarbeitungseinheit zum Ausführen zumindest der Aufgabe der Objekterkennung, wobei die Verarbeitungseinheit einen Codierer und einen Decodierer für die Aufgabe der Objekterkennung umfasst, wobei der Codierer derart konfiguriert ist, die Punktwolke als Eingabedaten zu empfangen, Merkmale, die zum Ausführen der Aufgaben der Objekterfassung erforderlich sind, basierend auf einem tiefen neuronalen Netzwerk von den Eingabedaten zu extrahieren und dem Decodierer die extrahierten Merkmale zuzuführen, wobei der Decodierer ein 3D-Vorschlagsnetzwerk ist, das zum Empfangen von Punktwolkendaten als Eingabedaten zum Erzeugen von 3D-Objektvorschlägen konfiguriert ist.In particular, the present invention provides a system for detecting an object in a three-dimensional environment of a host vehicle, the system comprising: a sensor unit configured to provide a point cloud representing a three-dimensional environment of the vehicle, a processing unit for executing at least the Object recognition task, wherein the processing unit comprises an encoder and a decoder for the object recognition task, the encoder being configured to receive the point cloud as input data, features required to perform the object detection tasks based on a deep neural Network from the input data and supply the extracted features to the decoder, wherein the decoder is a 3D suggestion network configured to receive point cloud data as input data for generating 3D object suggestions.

Die Sensoreinheit ist derart eingerichtet, die Verarbeitungseinheit mit der Punktwolke zu versorgen. Die Verarbeitungseinheit umfasst einen Codierer und mindestens einen Decodierer für die Aufgabe der Objekterkennung. Der Codierer ist konfiguriert, um die Punktwolke als Eingabedaten zu empfangen, Funktionen zu extrahieren, die für die Durchführung der Aufgaben der Objekterkennung aus den Eingabedaten auf der Grundlage eines tiefen neuronalen Netzwerks notwendig sind, und den Decodierer mit den extrahierten Funktionen zu versorgen. Der Decodierer umfasst ein neuronales Netzwerk, das seiner jeweiligen Aufgabe gewidmet ist.The sensor unit is set up to supply the processing unit with the point cloud. The processing unit comprises an encoder and at least one decoder for the task of object recognition. The encoder is configured to receive the point cloud as input data, extract functions necessary for performing the object recognition tasks from the input data on the basis of a deep neural network, and provide the decoder with the extracted functions. The decoder comprises a neural network dedicated to its particular task.

Die dreidimensionale Umgebung, die durch die vorliegende Erfindung analysiert werden soll, ist insbesondere die Umgebung eines Fahrzeugs, bei dem es sich insbesondere um ein sich bewegendes oder ein statisches Fahrzeug handelt. Zum Erfassen einer derartigen dreidimensionalen Umgebung weist die Eingabeeinheit des erfindungsgemäßen Systems vorzugsweise einen oder mehrere Lidar-Sensoren und/oder Radarsensoren auf oder besteht daraus. Es kann auch ein beliebiger anderer Sensortyp verwendet werden, der eine dreidimensionale Punktwolke bereitstellt.The three-dimensional environment to be analyzed by the present invention is in particular the environment of a vehicle, which is in particular a moving or a static vehicle. To detect such a three-dimensional environment, the input unit of the system according to the invention preferably has or consists of one or more lidar sensors and / or radar sensors. Any other type of sensor that provides a three-dimensional point cloud can also be used.

Wie vorstehend dargelegt wurde, ist eine Punktwolke, die eine dreidimensionale Umgebung darstellt, ein Satz von (Daten) Punkten in einem dreidimensionalen kartesischen Koordinatensystem. D.h., eine Punktwolke wird typischerweise als ein Satz von 3D-Punkten {Pi | i = 1,..., n} dargestellt, wobei jeder Punkt Pi ein Vektor ist, der die X-, Y- und Z-Koordinaten des Punkts plus optionale Merkmalskanäle wie Farbe, normal und ähnlich enthält. Die Aufgabe des Codierers des erfindungsgemäßen Systems besteht darin, ungeordnete/unregelmäßige Punktsätze als Eingabe (Daten) zu verwenden und daraus inhaltsreiche, abstrakte Merkmale zu extrahieren, die alle notwendige Information zum Ausführen einer genauen Objekterfassung.As stated above, a point cloud representing a three-dimensional environment is a set of (data) points in a three-dimensional Cartesian coordinate system. That is, a point cloud is typically referred to as a set of 3D points {Pi | i = 1, ..., n}, where each point Pi is a vector containing the point's X, Y, and Z coordinates plus optional feature channels such as color, normal, and similar. The task of the coder of the system according to the invention is to use disordered / irregular point sets as input (data) and to extract from them content-rich, abstract features that contain all the information necessary to carry out an accurate object detection.

Die Grundidee der Erfindung ist die Verwendung eines einstufigen Trainingsalgorithmus, der gemeinsam lernt, Objektvorschläge zu klassifizieren und ihre räumlichen Positionen zu verfeinern. Das Netzwerk verbraucht Punktesätze als Input. Das resultierende Verfahren kann ein sehr tiefes Detektionsnetzwerk trainieren.The basic idea of the invention is the use of a single-stage training algorithm which learns together to classify object proposals and to refine their spatial positions. The network consumes point sets as input. The resulting method can train a very deep detection network.

Das Netzwerk besteht aus einem Codierer als Merkmalsextrahierer aus den Punktwolkendaten und einem 3D-Vorschlagsnetzwerk. Das 3D-Vorschlagsnetzwerk nutzt die Punktwolke, um hochpräzise 3D-Kandidatenboxen zu generieren. Der Vorteil von 3D-Objektvorschlägen besteht darin, dass sie auf beliebige Ansichten im 3D-Raum projiziert werden können.The network consists of an encoder as a feature extractor from the point cloud data and a 3D suggestion network. The 3D suggestion network uses the point cloud to generate high-precision 3D candidate boxes. The advantage of 3D object suggestions is that they can be projected onto any view in 3D space.

Bei der 3D-Objekterkennung hat eine Punktwolke mehrere Vorteile gegenüber der Front-/Vogel-/Bildebene. Die Objekte bleiben physisch erhalten und haben somit keine Größenabweichung, was in der Front-/Vogel-/Bildebene nicht der Fall ist. Die Objekte in der Punktwolke belegen unterschiedlichen Raum und vermeiden so das Okklusionsproblem. In der Straßenszene, da Objekte typischerweise auf der Grundfäche liegen, ist die Punktwolkenposition wichtiger, um genaue 3D-Begrenzungsrahmen zu erhalten. In 3D object recognition, a point cloud has several advantages over the front / bird / image plane. The objects are physically preserved and therefore have no size deviation, which is not the case in the front / bird / image plane. The objects in the point cloud occupy different spaces and thus avoid the occlusion problem. In the street scene, since objects typically lie on the ground plane, the point cloud position is more important to get accurate 3-D bounding boxes.

Daher macht die Verwendung von Punktwolken als Eingabe die 3D-Standortvorhersage besser möglich.Hence, using point clouds as input makes 3D location prediction better possible.

Gemäß einer bevorzugten Ausführungsform weist der Codierer des erfindungsgemäßen Systems ein konvolutionelles neuronales Netzwerk auf, dessen Eingabeschicht die Punktwolke direkt als Eingabedaten empfängt. D.h., es erfolgt keine Umwandlung der Punktwolke; der Punktsatz der Punktwolke wird durch die Eingabeschicht des konvolutionellen neuronalen Netzwerks des Codierers direkt empfangen. Mit anderen Worten werden die Ausgabedaten der Sensoreinheit dem konvolutionellen neuronalen Netzwerk, das den Codierer bildet, als Rohpunktwolkendaten zugeführt. Vorzugsweise basiert das konvolutionelle neuronale Netzwerk des Codierers auf einem PointNet, wobei das PointNet insbesondere ein PointNet ist, wie es in „PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation‟ von Charles R. Qi, Hao Su, Kaichun Mo und Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, 2. Dezember 2016 , oder in „PointNet++: Deep Hierarchical Feature Learning in Point Sets in a Metric Space‟ von Charles R. Qui, Li Yi, Hao Su und Leonidas J. Guibas, arXiv Preprint, arXiv: 1706.02413, 7. Juni 2017 beschrieben ist. Insbesondere kann das konvolutionelle neuronale Netzwerk des Codierers beispielsweise die ersten acht Schichten des PointNet enthalten, wie es in „PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation‟ von Charles R. Qi, Hao Su, Kaichun Mo und Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, 2. Dezember 2016 , oder in „PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space‟ von Charles R. Qui, Li Yi, Hao Su und Leonidas J. Guibas, arXiv Preprint, arXiv: 1706.02413, 7. Juni 2017 beschrieben ist. Die Schichten des PointNet, auf dem der Codierer basiert, sind grundsätzlich eine Gruppe gemeinsam genutzter Schichten, die alle dafür trainiert sind, Merkmale, die allen drei Aufgaben gemeinsam sind, d.h. Objekterfassung, semantische Segmentierung und Klassifizierung, in den durch die Punktwolke gegebenen Eingabedaten zu erfassen.According to a preferred embodiment, the encoder of the system according to the invention has a convolutional neural network, the input layer of which receives the point cloud directly as input data. In other words, there is no conversion of the point cloud; the point set of the point cloud is received directly through the input layer of the encoder's convolutional neural network. In other words, the output data of the sensor unit is supplied to the convolutional neural network which forms the encoder as raw point cloud data. The convolutional neural network of the encoder is preferably based on a PointNet, the PointNet in particular being a PointNet, as shown in FIG "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation" by Charles R. Qi, Hao Su, Kaichun Mo and Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, December 2, 2016 , or in "PointNet ++: Deep Hierarchical Feature Learning in Point Sets in a Metric Space" by Charles R. Qui, Li Yi, Hao Su and Leonidas J. Guibas, arXiv Preprint, arXiv: 1706.02413, June 7, 2017 is described. In particular, the encoder's convolutional neural network can contain, for example, the first eight layers of the PointNet, as shown in FIG "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation" by Charles R. Qi, Hao Su, Kaichun Mo and Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, December 2, 2016 , or in "PointNet ++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space" by Charles R. Qui, Li Yi, Hao Su and Leonidas J. Guibas, arXiv Preprint, arXiv: 1706.02413, June 7, 2017 is described. The layers of the PointNet on which the encoder is based are basically a group of shared layers that are all trained to assign features common to all three tasks, i.e. object detection, semantic segmentation and classification, in the input data given by the point cloud capture.

Gemäß einer bevorzugten Ausführungsform der Erfindung umfasst die Sensoreinheit einen Lidarsensor und/oder einen Radarsensor. Vorzugsweise wird das System der Erfindung in einem Fahrzeug implementiert. D.h. nach einem weiteren Aspekt der Erfindung wird ein Fahrzeug bereitgestellt, das ein erfindungsgemäße System umfasst. Das Fahrzeug kann insbesondere für autonomes oder teilautonomes Fahren ausgelegt sein. According to a preferred embodiment of the invention, the sensor unit comprises a lidar sensor and / or a radar sensor. Preferably the system of the invention is implemented in a vehicle. That is, according to a further aspect of the invention, a vehicle is provided which comprises a system according to the invention. The vehicle can in particular be designed for autonomous or partially autonomous driving.

Gemäß einem weiteren Aspekt der Erfindung ist ein Verfahren zum Erfassen eines Objekts in einer dreidimensionalen Umgebung eines Trägerfahrzeugs vorgesehen. Das Verfahren der Erfindung umfasst die Schritte des Bereitstellens eines Systems zum Erfassen eines Objekts gemäß einem der vorhergehenden Ansprüche, Zuführen einer Punktwolke, die die dreidimensionale Umgebung des Fahrzeugs darstellt, als Eingabedaten zu einem Codierer, Extrahieren von Merkmalen eines Objekts, die für die Durchführung der Aufgabe der Objekterkennung erforderlich sind, aus den Eingabedaten durch den Codierer auf Basis eines tiefen neuronalen Netzwerks, Zuführen der extrahierten Merkmale des Objekts zu einem Decodierer für die Aufgabe der Objekterkennung, wobei der Decodierer ein 3D-Vorschlagsnetzwerk ist, Erzeugen von 3D-Objektvorschlägen aus der Punktwolke durch das 3D-Vorschlagsnetzwerk, Erzeugen eines 3D-orientierten Begrenzungsrahmens für jedes Objekt.According to a further aspect of the invention, a method for detecting an object in a three-dimensional environment of a carrier vehicle is provided. The method of the invention comprises the steps of providing a system for detecting an object according to one of the preceding claims, supplying a point cloud representing the three-dimensional environment of the vehicle as input data to an encoder, extracting features of an object which are necessary for the implementation of the Task of object recognition are required, from the input data by the encoder on the basis of a deep neural network, supplying the extracted features of the object to a decoder for the task of object recognition, the decoder being a 3D suggestion network, generating 3D object suggestions from the Point cloud through the 3D suggestion network, creating a 3D-oriented bounding box for each object.

Gemäß einer modifizierten Ausführungsform der Erfindung wird ein konvolutionelles neuronales Netzwerk als Codierer verwendet, von dem eine Eingangsschicht die Punktwolke direkt als Eingabedaten empfängt.According to a modified embodiment of the invention, a convolutional neural network is used as an encoder, from which an input layer receives the point cloud directly as input data.

Nach einer weiteren modifizierten Ausführungsform der Erfindung basiert das konvolutionelle neuronale Netzwerk des Codierers auf einem PointNet.According to a further modified embodiment of the invention, the convolutional neural network of the encoder is based on a PointNet.

Gemäß einer modifizierten Ausführungsform der Erfindung wird jeder 3D-orientierte Begrenzungsrahmen durch das Zentrum, die Größe, die Ausrichtung und die Objektklasse des 3D-Rahmens im Koordinatensystem der Sensoreinheit parametrisiert.According to a modified embodiment of the invention, each 3D-oriented bounding frame is parameterized by the center, the size, the orientation and the object class of the 3D frame in the coordinate system of the sensor unit.

Gemäß einer weiteren modifizierten Ausführungsform der Erfindung, umfasst das Zuführen einer Punktwolke, die die dreidimensionale Umgebung darstellt, als Eingabedaten zum Codierer das Zuführen von Daten von einem Lidar-Sensor und/oder einem Radarsensor.According to a further modified embodiment of the invention, the supply of a point cloud, which represents the three-dimensional environment, comprises the supply of data from a lidar sensor and / or a radar sensor as input data to the encoder.

Gemäß einem weiteren Aspekt der Erfindung wird ein Computerprogrammprodukt bereitgestellt, das Anweisungen umfasst, die, wenn das Programm von einem Computer ausgeführt wird, den Computer veranlassen, die Schritte des Verfahrens auszuführen.According to a further aspect of the invention, a computer program product is provided which comprises instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method.

Diese und andere Aspekte der Erfindung werden anhand der nachfolgend beschriebenen Ausführungsformen ersichtlich und erläutert. Einzelne Merkmale, die in den Ausführungsformen offenbart sind, können allein oder in Kombination einen Aspekt der vorliegenden Erfindung darstellen. Merkmale der verschiedenen Ausführungsformen können von einer Ausführungsform auf eine andere Ausführungsform übertragen werden.These and other aspects of the invention will be apparent and explained with reference to the embodiments described below. Individual features that are disclosed in the embodiments can, alone or in combination, constitute an aspect of the present invention. Features of the various embodiments can be transferred from one embodiment to another embodiment.

In den Zeichnungen:

1 eine durch einen Lidar-Sensor ausgegebene beispielhafte Punktwolke, die die Umgebung eines Fahrzeugs darstellt;
2 zeigt das PointNet wie es in „PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation‟ von Qi, C.R., Su, H., Mo, K. und Guibas, L.J., 2017, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (S. 652-660) beschrieben und dargestellt ist;
3 zeigt ein schematisches Diagramm, das ein System zum Erfassen eines Objekts in einer dreidimensionalen Umgebung eines Trägerfahrzeugs gemäß einer ersten, bevorzugten Ausführungsform der Erfindung darstellt; und
4 zeigt ein Ablaufdiagramm, das ein Verfahren zum Erfassen eines Objekts in einer dreidimensionalen Umgebung eines Trägerfahrzeugs gemäß dem System der ersten Ausführungsform der Erfindung darstellt.

In the drawings:

1 an exemplary point cloud representing the surroundings of a vehicle, output by a lidar sensor;
2 shows the PointNet as it is in "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation" by Qi, CR, Su, H., Mo, K. and Guibas, LJ, 2017, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (p. 652-660) is described and illustrated;
3 shows a schematic diagram illustrating a system for detecting an object in a three-dimensional environment of a carrier vehicle according to a first, preferred embodiment of the invention; and
4th FIG. 13 shows a flowchart illustrating a method for detecting an object in a three-dimensional environment of a host vehicle according to the system of the first embodiment of the invention.

Die 1 und 2 wurden im einleitenden Teil der Beschreibung beschrieben. Es wird auf die entsprechenden Textpassagen oben verwiesen.The 1 and 2 were described in the introductory part of the description. Reference is made to the relevant text passages above.

3 zeigt ein System zum Erfassen eines Objekts 18 in einer dreidimensionalen Umgebung eines Trägerfahrzeugs gemäß einer ersten, bevorzugten Ausführungsform der Erfindung. Das System 10 umfasst eine Sensoreinheit 11, die konfiguriert ist, um eine Punktwolke 15 als dreidimensionale Darstellung der Umgebung bereitzustellen, und eine Verarbeitungseinheit 12 zur Durchführung der Aufgabe der Objekterkennung in Bezug auf die Punktwolke 15, die die dreidimensionale Umgebung darstellt. 3 shows a system for detecting an object 18th in a three-dimensional environment of a carrier vehicle according to a first, preferred embodiment of the invention. The system 10 comprises a sensor unit 11 that is configured to be a point cloud 15th as a three-dimensional representation of the environment, and a processing unit 12th to carry out the task of object recognition in relation to the point cloud 15th representing the three-dimensional environment.

Die Sensoreinheit 11 gemäß der ersten Ausführungsform umfasst einen Lidarsensor, mit dem sie die dreidimensionale Umgebung misst/scannt. Die Objekterkennung kann mit verschiedenen Sensoren wie Lidarsensoren und/oder Radarsensoren durchgeführt werden. Lidarsensoren und Radarsensoren können Reichweiten- und Höheninformationen liefern, die für die Objekterkennung und -lokalisierung in 3D hilfreich sind. Die beiden Sensoren liefern es im Punktwolkenformat aus.The sensor unit 11 according to the first embodiment comprises a lidar sensor with which it measures / scans the three-dimensional environment. The object detection can be carried out with various sensors such as lidar sensors and / or radar sensors. Lidar sensors and radar sensors can provide range and height information that is useful for object detection and localization in 3D. The two sensors deliver it in point cloud format.

Die auf der linken Seite zeigt ein Beispiel für eine Punktwolke 15, die beim Scannen der dreidimensionalen Umgebung eines Fahrzeugs mit einem Lidarsensor 11 erhalten wird. Die Ausgangsdaten der Sensoreinheit 11 sind in Form einer Punktwolke, die von der Sensoreinheit 11 der Verarbeitungseinheit 12 zur Weiterverarbeitung zur Verfügung gestellt wird. Das System 10 mit der Sensoreinheit 11 und der Verarbeitungseinheit 12 kann in einem Fahrzeug implementiert werden, wobei die Sensoreinheit 11 vorzugsweise so angeordnet ist, dass sie die Umgebung vor dem Fahrzeug erfasst. Das System ist Teil eines Fahrunterstützungssystems des Fahrzeugs.The on the left shows an example of a point cloud 15th that occurs when scanning the three-dimensional surroundings of a vehicle with a lidar sensor 11 is obtained. The output data of the sensor unit 11 are in the form of a point cloud generated by the sensor unit 11 the processing unit 12th is made available for further processing. The system 10 with the sensor unit 11 and the processing unit 12th can be implemented in a vehicle, the sensor unit 11 is preferably arranged so that it detects the surroundings in front of the vehicle. The system is part of a driving support system of the vehicle.

Die Verarbeitungseinheit 12 umfasst einen Codierer 13 und einen Decodierer 14 für die Aufgabe der Objekterkennung. Der Codierer 13 der Verarbeitungseinheit 12 empfängt die Punktwolke 15, die von der Sensoreinheit 11 als Eingabedaten ausgegeben wird. Es wird keine zusätzliche Konvertierung der Punktwolke durchgeführt; die Punktwolke wird direkt in den Codierer 13 eingegeben. Der Codierer 13 extrahiert dann Merkmale, die für die Durchführung der Aufgabe der Objekterkennung aus der Eingangspunktwolke gemeinsam und notwendig sind. Dazu basiert der Codierer auf einem tiefen konvolutionelles neuronales Netzwerk, vorzugsweise dem unter „PointNet‟ beschriebenen PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation‟ von Charles R. Qi, Hao Su, Kaichun Mo und Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, 2. Dezember 2016 oder in „PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space‟ von Charles R. Qui, Li Yi, Hao Su und Leonidas J. Guibas, arXiv Preprint, arXiv: 1706.02413, 7. Juni 2017 . Der Codierer 13 umfasst insbesondere die ersten acht Schichten des oben genannten PointNet. Der Codierer 13 stellt dem nachfolgenden Decodierer 14 die extrahierten Merkmale als deren jeweilige Eingabedaten zur Verfügung.The processing unit 12th includes an encoder 13th and a decoder 14th for the task of object recognition. The encoder 13th the processing unit 12th receives the point cloud 15th by the sensor unit 11 is output as input data. No additional conversion of the point cloud is performed; the point cloud goes straight into the encoder 13th entered. The encoder 13th then extracts features that are common and necessary for performing the object recognition task from the entry point cloud. For this purpose, the encoder is based on a deep convolutional neural network, preferably the one below "PointNet" described PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation "by Charles R. Qi, Hao Su, Kaichun Mo and Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, December 2, 2016 or in "PointNet ++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space" by Charles R. Qui, Li Yi, Hao Su and Leonidas J. Guibas, arXiv Preprint, arXiv: 1706.02413, June 7, 2017 . The encoder 13th includes in particular the first eight layers of the PointNet mentioned above. The encoder 13th represents the subsequent decoder 14th the extracted features are available as their respective input data.

Der Decodierer 14 besteht aus einem 3D-Vorschlagsnetzwerk, das von einem Regionsvorschlagsnetzwerk (RPN) inspiriert ist. Das 3D-Vorschlagsnetzwerk ist ein tiefes neuronales Netzwerk/neuronale Netzwerkschicht, die speziell für die Aufgabe der Objekterkennung entwickelt wurden. Das 3D- Vorschlagsnetzwerk nutzt die Punktwolke 15, um hochgenaue 3D-orientierte Begrenzungsrahmen 17 für jedes Objekt 18 zu erzeugen. The decoder 14th consists of a 3D suggestion network inspired by a region suggestion network (RPN). The 3D suggestion network is a deep neural network / neural network layer specially developed for the task of object recognition. The 3D suggestion network uses the point cloud 15th to get highly accurate 3D oriented bounding boxes 17th for every object 18th to create.

Der Vorteil von 3D-Objektvorschlägen besteht darin, dass sie auf beliebige Ansichten im 3D-Raum projiziert werden können.The advantage of 3D object suggestions is that they can be projected onto any view in 3D space.

Bei einer Punktwolke erzeugt der Decodierer 14 aus der Punktwolke 15 3D-Boxenvorschläge. Danach wird durch den Decodierer 14 eine 3D-orientierte Begrenzungsbox 17 erzeugt, wobei jede 3D-orientierte Begrenzungsbox 17 parametrisiert wird durch (x, y, z, I, w, h), die das Zentrum und Größe (in Metern) der 3D-orientierten Begrenzungsbox 17 im Lidar- oder Radarkoordinatensystem sind.In the case of a point cloud, the decoder generates 14th from the point cloud 15th 3D box suggestions. After that it is done by the decoder 14th a 3D-oriented bounding box 17th generated, each 3D-oriented bounding box 17th is parameterized by (x, y, z, I, w, h), which is the center and size (in meters) of the 3D-oriented bounding box 17th are in the lidar or radar coordinate system.

Das Bild 16 auf der rechten Seite von 3 zeigt ein Beispiel für Objekte 18, die in der Punktwolke erfasst wurden, die die gescannte dreidimensionale Umgebung darstellt, wobei jedes Objekt 18 von einem orientierten Begrenzungsrahmen 17 umgeben ist, wie er vom System 10 gemäß der ersten Ausführungsform der Erfindung erzeugt wird.The picture 16 to the right of 3 shows an example of objects 18th captured in the point cloud representing the scanned three-dimensional environment, with each object 18th from an oriented bounding box 17th is surrounded, as he is, by the system 10 is generated according to the first embodiment of the invention.

4 zeigt ein Ablaufdiagramm zum Darstellen des erfindungsgemäßen Verfahrens zum Analysieren der dreidimensionalen Umgebung, beispielsweise einer Umgebung eines Fahrzeugs, wie es durch das in 3 dargestellte System 10 ausgeführt wird. In einem ersten Schritt S1 wird dem Codierer 13 der Verarbeitungseinheit 11 eine Punktwolke 15 zugeführt, die die dreidimensionale Umgebung darstellt. Die Punktwolke 15 wird von der Sensoreinheit 11 des Systems 10 bereitgestellt, insbesondere vom Lidarsensor, der in der Sensoreinheit 11 vorgesehen ist. 4th FIG. 13 shows a flowchart for illustrating the method according to the invention for analyzing the three-dimensional environment, for example an environment of a vehicle, as is shown by the FIG 3 illustrated system 10 is performed. In a first step S1 the encoder 13th the processing unit 11 a point cloud 15th supplied, which represents the three-dimensional environment. The point cloud 15th is from the sensor unit 11 of the system 10 provided, in particular by the lidar sensor in the sensor unit 11 is provided.

Im nachfolgenden Schritt S2 extrahiert der Codierer 13, der vorzugsweise auf PointNet basiert, Merkmale, die den Aufgaben der Objekterfassung, gemeinsam sind und zum Ausführen dieser Aufgaben erforderlich sind, von den Eingabedaten, d.h. von der empfangenen Punktwolke.In the subsequent step S2, the encoder extracts 13th , which is preferably based on PointNet, features that are common to the object detection tasks and are required to carry out these tasks from the input data, ie from the point cloud received.

Im nachfolgenden Schritt S3 wird dem Decodierer 14 für die Aufgabe der Objekterkennung die extrahierten Merkmalen zugeführt. Der Decodierer umfasst ein 3D-Vorschlagsnetzwerk 14, das von einem Regionsvorschlagsnetzwerk (RPN) inspiriert ist.In the subsequent step S3, the decoder 14th the extracted features are supplied for the object recognition task. The decoder includes a 3D suggestion network 14 that is inspired by a region suggestion network (RPN).

In Schritt S4 erzeugt das 3D-Vorschlagsnetzwerk 14 aus der Punktwolke 15 3D-Objektvorschläge.In step S4, the 3D suggestion network 14 generates from the point cloud 15th 3D object proposals.

Im letzten Schritt S5 erzeugt das 3D-Vorschlagsnetzwerk für jedes Objekt 18 eine 3D-orientierte Begrenzungsbox 17, wobei jede 3D-orientierte Begrenzungsbox 17 parametrisiert wird durch (x, y, z, I, w, h), die das Zentrum und Größe (in Metern) der 3D-Box 17 im Lidar- oder Radarkoordinatensystem sind.In the last step S5, the 3D suggestion network is generated for each object 18th a 3D-oriented bounding box 17th , with each 3D-oriented bounding box 17th is parameterized by (x, y, z, I, w, h), which are the center and size (in meters) of the 3D box 17 in the lidar or radar coordinate system.

BezugszeichenlisteList of reference symbols

11: PointNetPointNet
22: KlassifizierungsnetzwerkClassification network
33: SegmentierungsnetzwerkSegmentation network
44th: EingabetransformationInput transformation
55: MerkmalstransformationFeature transformation
66th: max-Pooling-Schichtmax pooling layer
77th: globale Merkmaleglobal characteristics
88th: lokales Merkmal/Pro-Punkt-Merkmallocal characteristic / per point characteristic
99: neue Pro-Punkt-Merkmalenew per-point features
1010: erfindungsgemäßes Systemsystem according to the invention
1111: SensoreinheitSensor unit
1212th: VerarbeitungseinheitProcessing unit
1313th: CodiererEncoder
1414th: DecodiererDecoder
1515th: Punktwolkendaten einer SzenePoint cloud data of a scene
1616: Szene mit 3D orientierten BegrenzungsrahmenScene with 3D oriented bounding boxes
1717th: 3D orientierter Begrenzungsrahmen3D oriented bounding box
1818th: Objektobject

ZITATE ENTHALTEN IN DER BESCHREIBUNGQUOTES INCLUDED IN THE DESCRIPTION

Diese Liste der vom Anmelder aufgeführten Dokumente wurde automatisiert erzeugt und ist ausschließlich zur besseren Information des Lesers aufgenommen. Die Liste ist nicht Bestandteil der deutschen Patent- bzw. Gebrauchsmusteranmeldung. Das DPMA übernimmt keinerlei Haftung für etwaige Fehler oder Auslassungen.This list of the documents listed by the applicant was generated automatically and is included solely for the better information of the reader. The list is not part of the German patent or utility model application. The DPMA assumes no liability for any errors or omissions.

Zitierte Nicht-PatentliteraturNon-patent literature cited

Charles R. Qi, Hao Su, Kaichun Mo and Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, December 2, 2016 [0008]
Deep Hierarchical Feature Learning on Point Sets in a Metric Space "by Charles R. Qui, Li Yi, Hao Su and Leonidas J. Guibas, arXiv Preprint, arXiv: 1706.02413, June 7, 2017 [0009]
"PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation" by Charles R. Qi, Hao Su, Kaichun Mo and Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, December 2, 2016 [0023]
"PointNet ++: Deep Hierarchical Feature Learning in Point Sets in a Metric Space" by Charles R. Qui, Li Yi, Hao Su and Leonidas J. Guibas, arXiv Preprint, arXiv: 1706.02413, June 7, 2017 [0023]
"PointNet ++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space" by Charles R. Qui, Li Yi, Hao Su and Leonidas J. Guibas, arXiv Preprint, arXiv: 1706.02413, June 7, 2017 [0023, 0037]
"PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation" by Qi, CR, Su, H., Mo, K. and Guibas, LJ, 2017, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (p. 652-660) [0032]
"PointNet" described PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation "by Charles R. Qi, Hao Su, Kaichun Mo and Leonidas J. Guibas, arXiv Preprint, arXiv: 1612.00593, December 2, 2016 [0037]

Claims

A system for detecting an object in a three-dimensional environment of a carrier vehicle, the system comprising: a sensor unit (11) which is configured to provide a point cloud (15) which represents a three-dimensional environment of the vehicle, a processing unit (12) for performing at least the object recognition task, the processing unit (12) comprising an encoder (13) and a decoder (14) for the object recognition task, wherein the encoder (13) is configured to receive the point cloud (15) as input data, features that are required for performing the object detection tasks based on a deep neural network to extract from the input data and the decoder (14) the to apply extracted features, wherein the decoder (14) is a 3D suggestion network configured to receive point cloud data (14) as input data for generating 3D object suggestions.

System according to Claim 1 wherein the encoder (13) comprises a convolutional neural network from which an input layer receives the point cloud (15) directly as input data.

System according to Claim 2 , the convolutional neural network of the encoder (13) being based on a PointNet.

System according to one of the Claims 1 to 3 , wherein the sensor unit (11) comprises a lidar sensor and / or a radar sensor.

A method for detecting an object in a three-dimensional environment of a carrier vehicle, the method comprising the following steps: Providing a system (10) for detecting an object according to one of the preceding claims, Supplying a point cloud (15), which represents the three-dimensional surroundings of the vehicle, as input data to an encoder (13), The coder (13) extracts features of an object that are necessary for performing the object recognition task on the basis of a deep neural network, Feeding the extracted features of the object to a decoder (14) for the task of object recognition, the decoder being a 3D suggestion network, Generation of 3-D object proposals from the point cloud (15) by the 3-D proposal network, Generating a 3D-oriented bounding frame (17) for each object (18).

Procedure according to Claim 5 wherein a convolutional neural network is used as the encoder (13), from which an input layer receives the point cloud (15) directly as input data.

Procedure according to Claim 6 , the convolutional neural network of the encoder (13) being based on a PointNet.

Method according to one of the Claims 5 to 7th , wherein each 3D-oriented bounding frame (17) is parameterized by the center, the size, the orientation and the object class of the 3D frame in the coordinate system of the sensor unit (11).

Method according to one of the Claims 5 to 7th wherein the supply of a point cloud (15), which represents the three-dimensional environment, as input data to the encoder (13) comprises the supply of data from a lidar sensor and / or a radar sensor.

Computer program product, comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method according to the Claims 5 to 9 executes.