DE102019117559A1

DE102019117559A1 - Method and system for merging two-dimensional semantic information from images with a three-dimensional point cloud

Info

Publication number: DE102019117559A1
Application number: DE102019117559.3A
Authority: DE
Inventors: Catherine Enright; Nagarjan Balmukundan; Jonathan Horgan; Olivia Donnellan; Srinidhi Simha; Ciaran Hughes; Senthil Kumar Yogamani
Original assignee: Connaught Electronics Ltd
Current assignee: Connaught Electronics Ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2020-12-31

Abstract

Die vorliegende Erfindung betrifft das Verfahren und System zum Fusionieren von zweidimensionalen semantischen Informationen aus Bildern mit einer dreidimensionalen Punktwolke. Das Verfahren beginnt typischerweise mit dem Empfangen mehrerer sequenzieller Bilder von einem oder mehreren Vision-Kamera-Bildsensoren. Danach wird unter Verwendung von Bewegungs-Stereo-Techniken eine dreidimensionale Punktwolke auf der Basis der mehreren sequenziellen Bilder erzeugt. Die dreidimensionale Punktwolke wird geclustert, um distinkte Objekte zu trennen, die in der Tiefenebene nicht nahe sind, und auf jedes distinkte Objekt wird ein vorläufiges Label angewandt. Gleichzeitig werden die empfangenen Bilder unter Verwendung eines trainierten gefalteten neuronalen Netzwerks verarbeitet, um zweidimensionale Umrandungskästen mit Labels und Oberflächenkarten für ein oder mehrere detektierte Objekte auf den empfangenen Bildern zu produzieren. Ferner werden die semantischen Informationen, d. h. die zweidimensionalen Umrandungskästen mit Labels und Oberflächenkarten mit einem oder mehreren detektierten distinkten Objekten, mit der dreidimensionalen Punktwolke fusioniert.The present invention relates to the method and system for merging two-dimensional semantic information from images with a three-dimensional point cloud. The method typically begins with receiving multiple sequential images from one or more vision camera image sensors. Then, using motion stereo techniques, a three-dimensional point cloud is generated based on the multiple sequential images. The three-dimensional point cloud is clustered to separate distinct objects that are not close in the depth plane, and a tentative label is applied to each distinct object. At the same time, the received images are processed using a trained convoluted neural network to produce two-dimensional bounding boxes with labels and surface maps for one or more detected objects on the received images. Furthermore, the semantic information, i. H. the two-dimensional border boxes with labels and surface maps with one or more detected distinct objects, fused with the three-dimensional point cloud.

Description

Technisches GebietTechnical area

Die vorliegende Offenbarung betrifft das Fusionieren von zweidimensionalen semantischen Informationen aus Bildern mit einer dreidimensionalen Punktwolke. Insbesondere betrifft die vorliegende Offenbarung Computer-Vision-Techniken zur Verbesserung der Objektdetektionskonfidenz.The present disclosure relates to fusing two-dimensional semantic information from images with a three-dimensional point cloud. In particular, the present disclosure relates to computer vision techniques for improving object detection confidence.

Stand der TechnikState of the art

Selbstfahrende Fahrzeuge verwenden Bildverarbeitung und spezieller das Detektieren von Objekten/Hindernissen auf der Straße und von Ampeln/Schildern usw., um das Fahrzeug in einem dreidimensionalen Raum zu manövrieren.Self-driving vehicles use image processing and, more specifically, the detection of objects / obstacles on the road and traffic lights / signs, etc., to maneuver the vehicle in a three-dimensional space.

Aktuelle Techniken ermöglichen Detektion und/oder Klassifizierung von Objekten in einem zweidimensionalen Bild. Solche Techniken stellen gewöhnlich einen Umrandungskasten bereit, der das detektierte und/oder klassifizierte Objekt einschließt. Objektdetektion und/oder -klassifizierung umfasst einen zweistufigen Ansatz; d. h. als erstes stellt eine Objektdetektionstechnik eine Gruppe von Umrandungskästen um die Objekte bereit und als zweites legt ein Nachverarbeitungsalgorithmus wie NMS (Non Maximum Supression) die redundanten sich überlappenden Kästen zu einem Umrandungskasten pro Objekt zusammen.Current techniques enable detection and / or classification of objects in a two-dimensional image. Such techniques usually provide a border box that encloses the detected and / or classified object. Object detection and / or classification involves a two-step approach; d. H. First, an object detection technique provides a group of border boxes around the objects, and second, a post-processing algorithm such as NMS (Non Maximum Supression) combines the redundant, overlapping boxes into one border box per object.

Die existierenden NMS-Algorithmen versagen bei mehreren Objekten oder Gruppen von Objekten insbesondere unter schlechten Lichtbedingungen und existierende NMS-Algorithmen funktionieren nur am zweidimensionalen Raum. Ferner sind existierende Ansätze zur Detektion/Klassifizierung von Objekten nicht immun gegenüber Reflexionen auf Oberflächen von Gebäuden. Eine Anzahl von Patenten in der Technik haben versucht, dieses Problem zu lösen, wie etwa die aus US9,697,606 (Waymo LLC) und US10,210,401 (The Boeing Company) bekannten. Diese Publikationen verwenden jedoch komplizierte Laserpunktwolkensysteme und erfordern, dass Lidar- und/oder IR-Sensoren genau arbeiten.The existing NMS algorithms fail with several objects or groups of objects, especially under poor lighting conditions, and existing NMS algorithms only work in two-dimensional space. Furthermore, existing approaches to the detection / classification of objects are not immune to reflections on surfaces of buildings. A number of patents in the art have attempted to solve this problem, such as those in US Pat US9,697,606 (Waymo LLC) and US10,210,401 (The Boeing Company). However, these publications use complicated laser point cloud systems and require lidar and / or IR sensors to work accurately.

Deshalb werden ein Verfahren und System, das zuverlässige Detektion/Klassifizierung von Objekten in einem Bild bereitstellt, und eine Computer-Vision-Technik zur Verbesserung der Objektdetektionskonfidenz benötigt.Therefore, what is needed is a method and system that provides reliable detection / classification of objects in an image, and computer vision technology for improving object detection confidence.

Kurzfassungshort version

Ausführungsformen der vorliegenden Erfindung betreffen ein System und ein Verfahren zum Fusionieren von zweidimensionalen semantischen Informationen aus Bildern mit einer dreidimensionalen Punktwolke und wie in den angefügten Ansprüchen dargelegt.Embodiments of the present invention relate to a system and method for fusing two-dimensional semantic information from images with a three-dimensional point cloud and as set out in the appended claims.

Bei einer Ausführungsform wird ein Verfahren zum Fusionieren von zweidimensionalen semantischen Informationen aus Bildern mit einer dreidimensionalen Punktwolke bereitgestellt. Das Verfahren beginnt typischerweise mit dem Empfangen mehrerer sequenzieller Bilder von einem oder mehreren Bildsensoren. Danach wird unter Verwendung von Bewegungs-Stereo-Techniken eine dreidimensionale Punktwolke auf der Basis der mehreren sequenziellen Bilder erzeugt. Die dreidimensionale Punktwolke wird geclustert, um distinkte Objekte zu trennen, die in der Tiefenebene nicht nahe sind, und auf jedes distinkte Objekt wird ein vorläufiges Label angewandt. Die empfangenen Bilder werden gleichzeitig durch Verwendung eines trainierten gefalteten neuronalen Netzwerks verarbeitet, um zweidimensionale Umrandungskästen mit Labels und Oberflächenkarten für ein oder mehrere detektierte Objekte auf den empfangenen Bildern zu produzieren. Ferner werden die semantischen Informationen, d. h. die zweidimensionalen Umrandungskästen mit Labels und Oberflächenkarten mit ein oder mehreren detektierten distinkten Objekten, mit der dreidimensionalen Punktwolke fusioniert.In one embodiment, a method for fusing two-dimensional semantic information from images with a three-dimensional point cloud is provided. The method typically begins with receiving multiple sequential images from one or more image sensors. Then, using motion stereo techniques, a three-dimensional point cloud is generated based on the multiple sequential images. The three-dimensional point cloud is clustered to separate distinct objects that are not close in the depth plane, and a tentative label is applied to each distinct object. The received images are simultaneously processed using a trained convoluted neural network to produce two-dimensional bounding boxes with labels and surface maps for one or more detected objects on the received images. Furthermore, the semantic information, i. H. the two-dimensional border boxes with labels and surface maps with one or more detected distinct objects, fused with the three-dimensional point cloud.

Bei einer Ausführungsform umfasst das Fusionieren Folgendes: zweidimensionale Umrandungskästen mit Labels und Oberflächenkarten werden auf die dreidimensionale Punktwolke rückprojiziert, und es wird eine gefilterte Liste von Objekten mit A-posteriori-Wahrscheinlichkeit von mehr als einer vorbestimmten Schwelle unter Verwendung bedingter Zufallsfelder erhalten.In one embodiment, the fusing comprises: two-dimensional bounding boxes with labels and surface maps are projected back onto the three-dimensional point cloud and a filtered list of objects with posterior probability greater than a predetermined threshold is obtained using conditional random fields.

Das Verfahren umfasst ferner Erzeugen eines dreidimensionalen Umrandungskastens, der jedes der gefilterten distinkten detektierten Objekte einschließt, wobei das Erzeugen des dreidimensionalen Umrandungskastens Folgendes umfasst:

für jedes gefilterte distinkte detektierte Objekt Erzeugen mehrerer infrage kommender dreidimensionaler Umrandungskästen, die mindestens teilweise das gefilterte distinkte detektierte Objekt in der dreidimensionalen Punktwolke einschließen, wobei die infrage kommenden dreidimensionalen Kästen auf der Basis eines oder mehrerer zweidimensionaler Umrandungskästen, die auf das distinkte detektierte Objekt zurückprojiziert werden, erzeugt werden,

The method further comprises creating a three-dimensional bounding box including each of the filtered distinct detected objects, wherein creating the three-dimensional bounding box comprises:

for each filtered distinct detected object generating a plurality of candidate three-dimensional border boxes which at least partially enclose the filtered distinct detected object in the three-dimensional point cloud, the three-dimensional boxes in question based on one or more two-dimensional border boxes which are projected back onto the distinct detected object , be generated,

Auswählen eines dreidimensionalen Umrandungskastens aus den mehreren infrage kommenden dreidimensionalen Umrandungskästen auf der Basis einer Optimierungstechnik, die den Schnitt über die Vereinigung des ausgewählten dreidimensionalen Umrandungskastens mit dem gefilterten distinkten detektierten Objekt maximiert, die Distanz zwischen dem ausgewählten dreidimensionalen Umrandungskasten und der Mitte des Clusters detektierter Objekte minimiert und der Konfidenz, die dem zweidimensionalen Kasten zugeordnet ist, auf dessen Basis der dreidimensionale Umrandungskasten erzeugt wird.Selecting a three-dimensional bordering box from the plurality of candidate three-dimensional bordering boxes based on an optimization technique that maximizes the intersection of the union of the selected three-dimensional bordering box with the filtered distinct detected object, the distance between the selected three-dimensional bordering box and the center of the cluster of detected objects is minimized and the confidence associated with the two-dimensional box on the basis of which the three-dimensional bordering box is generated.

Das System zum Fusionieren von zweidimensionalen semantischen Informationen aus Bildern mit einer dreidimensionalen Punktwolke umfasst mindestens einen Bildsensor zum Aufnehmen mehrerer sequenzieller Bilder von einem oder mehreren Bildsensoren, eine Bewegungs-Stereo-Einheit zum Erzeugen einer dreidimensionalen Punktwolke auf der Basis der mehreren sequenziellen Bilder unter Verwendung von Bewegungs-Stereo-Techniken, eine 3D-Clusterungseinheit zum Clustern in getrennte distinkte Objekte in der dreidimensionalen Punktwolke, die in der Tiefenebene nicht nahe sind, und eine 3D-Objektextraktionseinheit zum Anwenden eines vorläufigen Labels auf jedes distinkte Objekt, eine CNN-Klassifizierungseinheit zum gleichzeitigen Verarbeiten der empfangenen Bilder unter Verwendung eines trainierten gefalteten neuronalen Netzwerks zum Produzieren zweidimensionaler Umrandungskästen mit Labels und Oberflächenkarten für eine oder mehrere detektierte Objekte auf den empfangenen Bildern, eine 3D-Modellfilterungseinheit zum Fusionieren der zweidimensionalen Umrandungskästen mit Labels und Oberflächenkarten mit einem oder mehreren detektierten distinkten Objekten auf der dreidimensionalen Punktwolke.The system for fusing two-dimensional semantic information from images with a three-dimensional point cloud comprises at least one image sensor for recording multiple sequential images from one or more image sensors, a motion stereo unit for generating a three-dimensional point cloud on the basis of the multiple sequential images using Motion stereo techniques, a 3D clustering unit for clustering into separate distinct objects in the three-dimensional point cloud that are not close in the depth plane, and a 3D object extraction unit for applying a preliminary label to each distinct object, a CNN classification unit for simultaneous Processing the received images using a trained convoluted neural network to produce two-dimensional bounding boxes with labels and surface maps for one or more detected objects on the received images, a 3D model filter ungseinheit for fusing the two-dimensional border boxes with labels and surface maps with one or more detected distinct objects on the three-dimensional point cloud.

Außerdem wird ein Computerprogramm bereitgestellt, dass Programmanweisungen umfasst, um zu bewirken, dass ein Computerprogramm das obige Verfahren ausführt, das auf einem Aufzeichnungsmedium, Trägersignal oder Festwertspeicher verkörpert werden kann.In addition, a computer program is provided that comprises program instructions for causing a computer program to carry out the above method, which can be embodied on a recording medium, carrier signal or read-only memory.

FigurenlisteFigure list

Die Erfindung wird aus der folgenden Beschreibung einer Ausführungsform lediglich anhand eines Beispiels unter Bezugnahme auf die beigefügten Zeichnungen besser verständlich. Es zeigen:

1 ein Flussdiagramm einer Ausführungsform des erfindungsgemäßen Verfahrens oder Prozesses zum Fusionieren von zweidimensionalen semantischen Informationen aus Bildern mit einer dreidimensionalen Punktwolke;
2 eine Funktionsblockdarstellung der Komponenten eines Systems, das das erfindungsgemäße Verfahren oder den erfindungsgemäßen Prozess zum Fusionieren von zweidimensionalen semantischen Informationen aus Bildern mit einer dreidimensionalen Punktwolke ausführt; und
3 eine Architekturdarstellung der Vorrichtung, die das erfindungsgemäße Verfahren oder den erfindungsgemäßen Prozess zum Fusionieren von zweidimensionalen semantischen Informationen aus Bildern mit einer dreidimensionalen Punktwolke ausführt, gemäß einer der Ausführungsformen der beanspruchten Erfindung.

The invention will be better understood from the following description of an embodiment by way of example only with reference to the accompanying drawings. Show it:

1 a flowchart of an embodiment of the method or process according to the invention for merging two-dimensional semantic information from images with a three-dimensional point cloud;
2 a function block diagram of the components of a system that executes the inventive method or inventive process for merging two-dimensional semantic information from images with a three-dimensional point cloud; and
3 an architectural representation of the device that executes the inventive method or inventive process for merging two-dimensional semantic information from images with a three-dimensional point cloud, according to one of the embodiments of the claimed invention.

Ausführliche Beschreibung der ZeichnungenDetailed description of the drawings

1 ist ein Flussdiagramm einer Ausführungsform des erfindungsgemäßen Verfahrens oder Prozesses zum Fusionieren von zweidimensionalen semantischen Informationen aus Bildern mit einer dreidimensionalen Punktwolke gemäß einigen der Ausführungsformen der vorliegenden Erfindung. Das Verfahren zum Fusionieren von zweidimensionalen semantischen Informationen aus Bildern mit einer dreidimensionalen Punktwolke. Das Verfahren beginnt typischerweise mit dem Empfangen 101 mehrerer sequenzieller Bilder von einem oder mehreren Bildsensoren. Die Bildsensoren können eine Standard-Vision-Kamera sein, die in ein Fahrzeug eingebaut ist. Danach wird unter Verwendung von Bewegungs-Stereo-Techniken eine dreidimensionale Punktwolke auf der Basis der mehreren sequenzieller Bilder erzeugt 102. Die dreidimensionale Punktwolke wird geclustert 103, um distinkte Objekte zu trennen, die in der Tiefenebene nicht nahe sind, und auf jedes distinkte Objekt wird ein vorläufiges Label angewandt 103. Gleichzeitig werden die empfangenen Bilder durch Verwendung eines trainierten gefalteten neuronalen Netzwerks (CNN) verarbeitet 104, um zweidimensionale Umrandungskästen mit Labels und Oberflächenkarten für ein oder mehrere detektierte Objekte auf den empfangenen Bildern zu produzieren. Ferner werden die semantischen Informationen, d. h. die zweidimensionalen Umrandungskästen mit Labels und Oberflächenkarten mit einem oder mehreren detektierten distinkten Objekten mit der dreidimensionalen Punktwolke fusioniert 105. Anders ausgedrückt werden die semantischen Informationen, d. h. die Objekte, die durch das trainierte CNN detektiert und klassifiziert werden, indem an den zweidimensionalen (durch die Bildsensoren aufgenommenen) Bildern operiert wird, mit den aus der dreidimensionalen Punktwolke erhaltenen jeweiligen dreidimensionalen Objekten fusioniert. Das Label jedes Umrandungskastens umfasst die Klasse des detektierten Objekts und der Konfidenz der Detektion und Klassifizierungskonfidenz des detektierten Objekts. 1 Figure 4 is a flow diagram of an embodiment of the inventive method or process for fusing two-dimensional semantic information from images with a three-dimensional point cloud according to some of the embodiments of the present invention. The process of fusing two-dimensional semantic information from images with a three-dimensional point cloud. The method typically begins with receiving 101 multiple sequential images from one or more image sensors. The image sensors can be a standard vision camera built into a vehicle. Thereafter, using motion stereo techniques, a three-dimensional point cloud is generated 102 based on the multiple sequential images. The three-dimensional point cloud is clustered 103 to separate distinct objects that are not close in the depth plane and point to each distinct object a tentative label applied 103. At the same time, the received images are processed 104 using a trained convoluted neural network (CNN) to produce two-dimensional border boxes with labels and surface maps for one or more detected objects on the received images. In addition, the semantic information, ie the two-dimensional border boxes with labels and surface maps with one or more detected distinct objects, are fused 105 with the three-dimensional point cloud. In other words, the semantic information, ie the objects that are detected and classified by the trained CNN by the two-dimensional images (recorded by the image sensors) are operated upon, fused with the respective three-dimensional objects obtained from the three-dimensional point cloud. The label of each border box includes the class of the detected object and the confidence of detection and classification confidence of the detected object.

Bei einer Ausführungsform umfasst das Fusionieren Rückprojizieren der durch das CNN detektierten zweidimensionalen Objekte, d. h. zweidimensionaler Umrandungskästen mit Labels und Oberflächenkarten auf der dreidimensionalen Punktwolke. Ferner wird eine gefilterte Liste von Objekten unter Verwendung von konditionalen Zufallsfeldern mit A-posteriori-Wahrscheinlichkeit von mehr als einer vorbestimmten Schwelle erhalten.In one embodiment, the merging comprises back-projecting the two-dimensional objects detected by the CNN, ie two-dimensional border boxes with labels and surface maps on the three-dimensional point cloud. Furthermore, a filtered list of objects is generated using conditional random fields with A- posteriori probability of more than a predetermined threshold.

Bei einer beispielhaften Ausführungsform werden die Kraftfahrzeug-Szenenkontextmodelle verwendet, um Objekte zu extrahieren, z. B. werden die Kraftfahrzeug-Szenenkontextmodelle trainiert, dass eine Straße fast flach ist und die meisten Objekte vertikal auf ihr stehen. Somit wird zuerst die dreidimensionale Straßenebene extrahiert, und danach werden andere dreidimensionale Objekte extrahiert, die sich über der Straßenebene befinden. Die durch die CNN bereitgestellten Oberflächenkarten (Straßenoberfläche) stellen deshalb eine genaue Lokalisierung der detektierten Objekte auf der dreidimensionalen Punktwolke bereit. Ferner werden die durch das CNN bereitgestellten Umrandungskästen, die Objektinformationen aufweisen, auf die detektierten dreidimensionalen Objekte rückprojiziert. Diese Rückprojektion fusioniert die semantischen Informationen, d. h. der Umrandungskasten mit durch das CNN bereitgestellten Labels wird mit den aus der geclusterten dreidimensionalen Punktwolke erhaltenen distinkten dreidimensionalen Objekten fusioniert. Ferner wird eine gefilterte Liste von Objekten erhalten, indem die fusionierten Informationen unter Verwendung von konditionalen Zufallsfeldern weitergeleitet werden.In an exemplary embodiment, the automotive scene context models are used to extract objects, e.g. B. the motor vehicle scene context models are trained that a road is almost flat and most objects stand vertically on it. Thus, the three-dimensional street plane is extracted first, and then other three-dimensional objects located above the street plane are extracted. The surface maps (road surface) provided by the CNN therefore provide a precise localization of the detected objects on the three-dimensional point cloud. Furthermore, the border boxes provided by the CNN, which contain object information, are projected back onto the detected three-dimensional objects. This back projection fuses the semantic information, i.e. H. the border box with labels provided by the CNN is fused with the distinct three-dimensional objects obtained from the clustered three-dimensional point cloud. Furthermore, a filtered list of objects is obtained by forwarding the merged information using conditional random fields.

Zum Beispiel der Fall, bei dem ein Straßenmarkierungsobjekt durch ein CNN detektiert wird. Die Straßenmarkierung kann auf die Bodenoberfläche (3D) rückprojiziert werden, wodurch ein viel genauerer Ort gegeben wird, insbesondere in Bereichen mit geneigtem Boden, was die Klassifizierungskonfidenz der Bodenmarkierung aufgrund des Umstands verbessert, dass sie auf die bekannte Bodenoberfläche projiziert wird, um somit die Wahrscheinlichkeit zu vergrößern, dass sie eine Bodenmarkierung ist. Unter Verwendung des konditionalen Zufallsfelds wird deshalb die Detektions-/Klassifizierungskonfidenz der Bodenmarkierung vergrößert. Im Gegensatz dazu würde unter Verwendung des konditionalen Zufallsfelds der detektierten Bodenmarkierung, die auf eine andere Oberfläche, z. B. ein Gebäude, rückprojiziert wird, ähnlich ihre Klassifizierungskonfidenz verkleinert, da sie wahrscheinlich eine Reflexion ist, im Gegensatz zu einer echten Straßenmarkierung. Die Detektions-/Klassifizierungsgenauigkeit des Gesamtprozesses weist somit bei der Detektion/Klassifizierung der detektierten Objekte eine synergistische Verbesserung auf. Die Objekte mit Konfidenz von weniger als der Schwelle werden ferner verworfen, z. B. kann die auf ein Gebäude rückprojizierte Straßenmarkierung verworfen werden.For example, the case where a road marking object is detected by a CNN. The road marking can be back projected onto the ground surface (3D), giving a much more accurate location, especially in sloping ground areas, which improves the floor marking's classification confidence due to the fact that it is projected onto the known ground surface, thereby increasing the likelihood to enlarge that it is a floor marker. Using the conditional random field, the detection / classification confidence of the floor marking is therefore increased. In contrast, using the conditional random field of the detected floor marking, which is applied to another surface, e.g. B. a building, being back projected, similarly reduces its classification confidence as it is likely to be a reflection as opposed to a real road marking. The detection / classification accuracy of the overall process thus shows a synergistic improvement in the detection / classification of the detected objects. The objects with confidence less than the threshold are also discarded, e.g. B. the road marking projected back onto a building can be discarded.

Das Verfahren umfasst ferner Erzeugen eines dreidimensionalen Umrandungskastens, der jedes der gefilterten distinkten detektierten Objekte einschließt, wobei die Erzeugung der dreidimensionalen Umrandung anfänglich mit dem Erzeugen mehrerer infrage kommender dreidimensionaler Umrandungskästen beginnt, die das gefilterte distinkte detektierte Objekt in der dreidimensionalen Punktwolke mindestens teilweise einschließen. Die infrage kommenden dreidimensionalen Kästen werden jeweils auf der Basis eines oder mehrerer zweidimensionaler auf das distinkte detektierte Objekt rückprojizierter Umrandungskästen erzeugt. Zum Beispiel stellt das CNN mehrere Umrandungskästen bereit, die das Objekt einschließen, wodurch Höhe und Breite bereitgestellt werden. Die Tiefe wird aus der dreidimensionalen Punktwolke des Objekts wie zuvor besprochen erhalten. Dadurch wird ein infrage kommender dreidimensionaler Umrandungskasten bereitgestellt, der das detektierte Objekt einschließt. Für Fachleute ist erkennbar, dass das CNN mehrere Umrandungskästen für dasselbe Objekt ausgibt und daher für dasselbe Objekt mehrere dreidimensionale Umrandungskästen erhalten werden.The method further comprises creating a three-dimensional border box including each of the filtered distinct detected objects, wherein the creation of the three-dimensional border initially begins with the creation of a plurality of candidate three-dimensional border boxes that at least partially enclose the filtered distinct detected object in the three-dimensional point cloud. The three-dimensional boxes in question are each generated on the basis of one or more two-dimensional border boxes projected back onto the distinct detected object. For example, the CNN provides several border boxes that enclose the object, thereby providing height and width. The depth is obtained from the three-dimensional point cloud of the object as discussed previously. This provides a candidate three-dimensional border box that encloses the detected object. It will be recognized by those skilled in the art that the CNN outputs multiple border boxes for the same object and therefore multiple three-dimensional border boxes are obtained for the same object.

Aus den mehreren dreidimensionalen Umrandungskästen wird ein optimaler dreidimensionaler Umrandungskasten ausgewählt. Die Auswahl eines dreidimensionalen Umrandungskastens aus den mehreren infrage kommenden dreidimensionalen Umrandungskästen basiert auf einer Optimierungstechnik, die Schnitt über Vereinigung der ausgewählten dreidimensionalen Umrandungskasten mit dem gefilterten distinkten detektierten Objekt maximiert, die Distanz zwischen dem ausgewählten dreidimensionalen Umrandungskasten und der Mitte des Clusters der detektierten Objekte minimiert, und der Konfidenz, wie dem zweidimensionalen Umrandungskasten zugeordnet ist, auf dessen Basis der dreidimensionale Umrandungskasten erzeugt wird. Die Optimierungstechnik ist eine von Ant Colony, Gauss-Newton und Levenberg-Marquardt.An optimal three-dimensional border box is selected from the multiple three-dimensional border boxes. The selection of a three-dimensional bordering box from the several possible three-dimensional bordering boxes is based on an optimization technique that maximizes the intersection by combining the selected three-dimensional bordering box with the filtered distinct detected object, minimizing the distance between the selected three-dimensional bordering box and the center of the cluster of detected objects, and the confidence associated with the two-dimensional bordering box based on which the three-dimensional bordering box is generated. The optimization technique is one of Ant Colony, Gauss-Newton, and Levenberg-Marquardt.

2 ist eine Funktionsblockdarstellung der Komponenten eines Systems das das erfindungsgemäß Verfahren oder den erfindungsgemäßen Prozess zum Fusionieren zweidimensionaler semantischer Informationen aus Bildern mit einer dreidimensionalen Punktwolke ausführt. Das System umfasst mindestens einen Bildsensor 201, wie etwa eine Standard-Vision-Kamera, zum Aufnehmen mehrerer sequenzieller Bilder von einem oder mehreren Bildsensoren. Eine Bewegungs-Stereo-Einheit 202 zum Erzeugen einer dreidimensionalen Punktwolke auf der Basis der mehreren sequenziellen Bilder unter Verwendung von Bewegungs-Stereo-Techniken. Eine 3D-Clusterungseinheit 203 zum Clustern von Punkten in der Punktwolke zum Trennen distinkter Objekte in der dreidimensionalen Punktwolke, die in der Tiefenebene nicht nahe sind, und eine 3D-Objektextraktionseinheit 204 zum Anwenden eines vorläufigen Labels auf jedes distinkte Objekt. Eine Objekterkennungseinheit 207 umfasst eine CNN-Klassifizierungseinheit 208 und eine 3D-Rückprojektionseinheit 209. Die CNN-Klassifizierungseinheit 208 zum gleichzeitigen Verarbeiten der empfangenen Bilder unter Verwendung eines trainierten gefalteten neuronalen Netzwerks (CNN) zum Produzieren von zweidimensionalen Umrandungskästen mit Labels und Oberflächenkarten für ein oder mehrere detektierte Objekte auf den empfangenen Bildern. Die 3D-Rückprojektionseinheit 209 zur Rückprojektion der zweidimensionalen Umrandungskästen mit Labels und Oberflächenkarten auf der dreidimensionalen Punktwolke. 2 is a function block diagram of the components of a system that executes the inventive method or inventive process for merging two-dimensional semantic information from images with a three-dimensional point cloud. The system includes at least one image sensor 201 such as a standard vision camera, for capturing multiple sequential images from one or more image sensors. A motion stereo unit 202 for generating a three-dimensional point cloud based on the plurality of sequential images using motion stereo techniques. A 3D clustering unit 203 for clustering points in the point cloud to separate distinct objects in the three-dimensional point cloud that are not close in the depth plane, and a 3-D object extraction unit 204 for applying a preliminary label to each distinct object. An object recognition unit 207 includes a CNN Classification unit 208 and a 3D rear projection unit 209 . The CNN classification unit 208 for simultaneously processing the received images using a trained convoluted neural network (CNN) to produce two-dimensional bounding boxes with labels and surface maps for one or more detected objects on the received images. The 3D rear projection unit 209 for the rear projection of the two-dimensional border boxes with labels and surface maps on the three-dimensional point cloud.

Eine 3D-Modellfilterungseinheit 205 zum Fusionieren der zweidimensionalen Umrandungskästen mit Labels und Oberflächenkarten mit einem oder mehreren detektierten distinkten Objekten auf der dreidimensionalen Punktwolke. Ferner die 3D-Modellfilterungseinheit zum Filtern der distinkten Objekte auf der Basis der Rückprojektion unter Verwendung von konditionalen Zufallsfeldern zur Bereitstellung einer Liste von Objekten mit A-posteriori-Wahrscheinlichkeit von mehr als einer vorbestimmten Schwelle.A 3D model filtering unit 205 for fusing the two-dimensional border boxes with labels and surface maps with one or more detected distinct objects on the three-dimensional point cloud. Furthermore, the 3D model filtering unit for filtering the distinct objects on the basis of the back projection using conditional random fields to provide a list of objects with a posteriori probability of more than a predetermined threshold.

Eine Global-Glättungseinheit 206 zum Durchführen einer globalen Glättung unter Verwendung eines Ebenen-Sweep-Algorithmus zur Beseitigung von Löchern in der Schätzung.A global smoothing unit 206 to perform global smoothing using a plane sweep algorithm to remove holes in the estimate.

Eine auf Fusionierung basierende NMS-Einheit (Non Maximum Suppression) 210 zum Erzeugen eines dreidimensionalen Umrandungskastens, der jedes der gefilterten distinkten detektierten Objekte einschließt. Der erzeugte dreidimensionale Umrandungskasten schließt jedes der gefilterten distinkten detektierten Objekte ein, die Erzeugung der dreidimensionalen Umrandung umfasst Erzeugen mehrerer infrage kommender dreidimensionaler Umrandungskästen, die das gefilterte distinkte detektierte Objekt in der dreidimensionalen Punktwolke mindestens teilweise einschließen. Die infrage kommenden dreidimensionalen Kästen werden jeweils auf der Basis eines oder mehrerer auf das distinkte detektierte Objekt rückprojizierter zweidimensionaler Umrandungskästen erzeugt. Zum Beispiel stellt das CNN mehrere Umrandungskästen bereit, die das Objekt einschließen, wodurch Höhe und Breite bereitgestellt werden. Die Tiefe wird aus der dreidimensionalen Punktwolke des Objekts wie oben besprochen erhalten. Dadurch wird ein infrage kommender dreidimensionaler Umrandungskasten, der das detektierte Objekt einschließt, bereitgestellt. Für Fachleute ist erkennbar, dass das CNN mehrere Umrandungskästen für dasselbe Objekt ausgibt und daher für dasselbe Objekt mehrere dreidimensionale Umrandungskästen erhalten werden.A fusion-based NMS (Non Maximum Suppression) unit 210 for creating a three dimensional bounding box enclosing each of the filtered distinct detected objects. The generated three-dimensional bordering box includes each of the filtered distinct detected objects, generating the three-dimensional bordering includes generating a plurality of candidate three-dimensional bordering boxes that at least partially enclose the filtered distinct detected object in the three-dimensional point cloud. The three-dimensional boxes in question are each generated on the basis of one or more two-dimensional border boxes projected back onto the distinct detected object. For example, the CNN provides several border boxes that enclose the object, thereby providing height and width. The depth is obtained from the three-dimensional point cloud of the object as discussed above. This provides a candidate three-dimensional border box enclosing the detected object. It will be recognized by those skilled in the art that the CNN outputs multiple border boxes for the same object and therefore multiple three-dimensional border boxes are obtained for the same object.

Aus den mehreren dreidimensionalen Umrandungskästen wird ein optimaler dreidimensionaler Umrandungskasten ausgewählt. Die Auswahl eines dreidimensionalen Umrandungskastens aus den mehreren infrage kommenden dreidimensionalen Umrandungskästen basiert auf einer Optimierungstechnik, die Schnitt über Vereinigung des ausgewählten dreidimensionalen Umrandungskastens mit dem gefilterten distinkten detektierten Objekt maximiert, die Distanz zwischen dem ausgewählten dreidimensionalen Umrandungskasten und der Mitte des Clusters der detektierten Objekte minimiert, und der Konfidenz, die dem zweidimensionalen Umrandungskasten zugeordnet ist, auf dessen Basis der ausgewählte dreidimensionale Umrandungskasten erzeugt wird. Die Optimierungstechnik ist eine von Ant Colony, Gauss-Newton und Levenberg-Marquardt.An optimal three-dimensional border box is selected from the multiple three-dimensional border boxes. The selection of a three-dimensional bordering box from the several possible three-dimensional bordering boxes is based on an optimization technique that maximizes the intersection by combining the selected three-dimensional bordering box with the filtered distinct detected object, minimizing the distance between the selected three-dimensional bordering box and the center of the cluster of detected objects, and the confidence associated with the two-dimensional bordering box based on which the selected three-dimensional bordering box is generated. The optimization technique is one of Ant Colony, Gauss-Newton, and Levenberg-Marquardt.

3 ist eine Architekturdarstellung der Vorrichtung, die das erfindungsgemäße Verfahren oder den erfindungsgemäßen Prozess zum Fusionieren zweidimensionaler semantischer Informationen aus Bildern mit einer dreidimensionalen Punktwolke ausführt, gemäß einer der Ausführungsformen der beanspruchten Erfindung. Die Vorrichtung umfasst eine Bildaufnahmevorrichtung 303 zum Aufnehmen mehrerer sequenzieller Bilder. Einen Speicher 302 zum Speichern von Anweisungen und einen wirksam mit der Bildaufnahmevorrichtung 303 und dem Speicher 302 gekoppelten Prozessor 301, wobei die gespeicherten Anweisungen den Prozessor 301 zu Folgendem veranlassen: Erzeugen einer dreidimensionalen Punktwolke auf der Basis der mehreren sequenziellen Bilder unter Verwendung von Bewegungs-Stereo-Techniken; Clustern zum Trennen distinkter Objekte in der dreidimensionalen Punktwolke, die in der Tiefenebene nicht nahe sind, und Anwenden eines vorläufigen Labels auf jedes distinkte Objekt; gleichzeitig Verarbeiten der empfangenen Bilder unter Verwendung eines trainierten gefalteten neuronalen Netzwerks zum Produzieren von zweidimensionalen Umrandungskästen mit Labels und Oberflächenkarten für ein oder mehrere detektierte Objekte auf den empfangenen Bildern; Fusionieren der zweidimensionalen Umrandungskästen mit Labels und Oberflächenkarten mit einem oder mehreren detektierten distinkten Objekten auf der dreidimensionalen Punktwolke. 3 Fig. 13 is an architectural representation of the apparatus that executes the inventive method or process for fusing two-dimensional semantic information from images with a three-dimensional point cloud, according to one of the embodiments of the claimed invention. The device comprises an image pickup device 303 for taking multiple sequential pictures. A memory 302 for storing instructions and one operative with the image capture device 303 and the memory 302 coupled processor 301 , with the stored instructions affecting the processor 301 cause a three-dimensional point cloud to be generated based on the plurality of sequential images using motion stereo techniques; Clustering to separate distinct objects in the three-dimensional point cloud that are not close in the depth plane and applying a tentative label to each distinct object; concurrently processing the received images using a trained convoluted neural network to produce two-dimensional bounding boxes with labels and surface maps for one or more detected objects on the received images; Merging the two-dimensional border boxes with labels and surface maps with one or more detected distinct objects on the three-dimensional point cloud.

Das Fusionieren umfasst Rückprojektion der zweidimensionalen Umrandungskästen mit Labels und Oberflächenkarten auf der dreidimensionalen Punktwolke und Filtern der distinkten Objekte auf der Basis der Rückprojektion unter Verwendung von konditionalen Zufallsfeldern zur Bereitstellung einer Liste von Objekten mit A-posteriori-Wahrscheinlichkeit von mehr als einer vorbestimmten Schwelle.The fusing includes back-projecting the two-dimensional bordering boxes with labels and surface maps on the three-dimensional point cloud and filtering the distinct objects based on the back-projection using conditional random fields to provide a list of objects with a posteriori probability greater than a predetermined threshold.

Die gespeicherten Anweisungen veranlassen den Prozessor 301 ferner zum Erzeugen eines dreidimensionalen Umrandungskastens, der jedes der gefilterten distinkten detektierten Objekte einschließt. Die Erzeugung des dreidimensionalen Umrandungskastens umfasst Folgendes:

für jedes gefilterte distinkte detektierte Objekt Erzeugen mehrerer infrage kommender dreidimensionaler Umrandungskästen, die das gefilterte distinkte detektierte Objekt in der dreidimensionalen Punktwolke mindestens teilweise einschließen, wobei die infrage kommenden dreidimensionalen Kästen auf der Basis eines oder mehrerer auf das distinkte detektierte Objekt rückprojizierter zweidimensionaler Umrandungskästen erzeugt werden; und
Auswählen eines dreidimensionalen Umrandungskastens aus den mehreren infrage kommenden dreidimensionalen Umrandungskästen auf der Basis einer Optimierungstechnik, die Schnitt über Vereinigung des ausgewählten dreidimensionalen Umrandungskastens mit dem gefilterten distinkten detektierten Objekt maximiert, die Distanz zwischen dem ausgewählten dreidimensionalen Umrandungskasten und der Mitte des Clusters detektierter Objekte minimiert, und der Konfidenz, die dem zweidimensionalen Umrandungskasten zugeordnet ist, auf dessen Basis der dreidimensionale Umrandungskasten erzeugt wird. Die Optimierungstechnik ist eine von Ant Colony, Gauss-Newton und Levenberg-Marquardt.

The stored instructions cause the processor 301 further to create a three-dimensional bounding box enclosing each of the filtered distinct detected objects. The creation of the three-dimensional border box includes the following:

for each filtered distinct detected object generating a plurality of candidate three-dimensional bordering boxes which at least partially enclose the filtered distinct detected object in the three-dimensional point cloud, the candidate three-dimensional boxes being generated on the basis of one or more two-dimensional bordering boxes projected back onto the distinct detected object; and
Selecting a three-dimensional bordering box from the plurality of candidate three-dimensional bordering boxes based on an optimization technique that maximizes intersection through union of the selected three-dimensional bordering box with the filtered distinct detected object, minimizing the distance between the selected three-dimensional bordering box and the center of the cluster of detected objects, and the confidence associated with the two-dimensional bordering box based on which the three-dimensional bordering box is generated. The optimization technique is one of Ant Colony, Gauss-Newton, and Levenberg-Marquardt.

Ferner ist für Durchschnittsfachleute erkennbar, dass die verschiedenen beispielhaften logischen bzw. Funktionsblöcke, Module, Schaltungen, Einheiten und Prozessschritte, die in Verbindung mit den hier offenbarten Ausführungsformen beschrieben werden, als elektronische Hardware oder eine Kombination von Hardware und Software implementiert werden können. Um diese Austauschbarkeit von Hardware und einer Kombination von Hardware und Software deutlich zu veranschaulichen, wurden oben verschiedene beispielhafte Komponenten, Einheiten, Blöcke, Module, Schaltungen und Schritte allgemein über ihre Funktionalität beschrieben. Ob solche Funktionalität als Hardware oder eine Kombination von Hardware und Software implementiert wird, hängt von der Entwurfswahl von Durchschnittsfachleuten ab. Solche erfahrenen Techniker können die beschriebene Funktionalität auf vielfältige Weise für jede konkrete Anwendung implementieren, aber solche offensichtlichen Entwurfswahlen sollten nicht als Abweichung vom Schutzumfang der vorliegenden Erfindung gedeutet werden. Further, it will be apparent to those of ordinary skill in the art that the various exemplary logic blocks, modules, circuits, units, and process steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware or a combination of hardware and software. In order to clearly illustrate this interchangeability of hardware and a combination of hardware and software, various exemplary components, units, blocks, modules, circuits and steps have been described above in general about their functionality. Whether such functionality is implemented in hardware or a combination of hardware and software depends on the design choices made by those of ordinary skill in the art. Such skilled technicians can implement the functionality described in a variety of ways for any particular application, but such obvious design choices should not be interpreted as a departure from the scope of the present invention.

Der in der vorliegenden Offenbarung beschriebene Prozess kann unter Verwendung verschiedener Mittel implementiert werden. Zum Beispiel kann der in der vorliegenden Offenbarung beschriebene Prozess in Hardware, Firmware, Software oder einer beliebigen Kombination davon implementiert werden. Bei einer Hardwareimplementierung können die Verarbeitungseinheiten oder Prozessor(en) in einem oder mehreren ASIC (anwendungsspezifische integrierte Schaltungen), DSP (digitalen Signalprozessoren), DSPD (digitalen Signalverarbeitungsvorrichtungen), PLD (programmierbaren Logikvorrichtungen), FPGA (Field Programmable Gate Arrays), Prozessoren, Controllern, Mikrocontrollern, Mikroprozessoren, elektronischen Vorrichtungen, anderen elektronischen Einheiten, die dafür ausgelegt sind, die hier beschriebenen Funktionen auszuführen, oder einer Kombination davon implementiert werden.The process described in the present disclosure can be implemented using various means. For example, the process described in the present disclosure can be implemented in hardware, firmware, software, or any combination thereof. In a hardware implementation, the processing units or processor (s) may be in one or more ASIC (application specific integrated circuits), DSP (digital signal processors), DSPD (digital signal processing devices), PLD (programmable logic devices), FPGA (field programmable gate arrays), processors, Controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.

Bei einer Firmware- und/oder Softwareimplementierung können Softwarecodes in einem Speicher gespeichert und durch einen Prozessor ausgeführt werden. Speicher kann in der Prozessoreinheit oder außerhalb der Prozessoreinheit implementiert werden. Im vorliegenden Gebrauch bezieht sich der Ausdruck „Speicher“ auf eine beliebige Art von flüchtigen Speicher oder nichtflüchtigen Speicher.In a firmware and / or software implementation, software codes can be stored in memory and executed by a processor. Memory can be implemented in the processor unit or outside the processor unit. As used herein, the term “memory” refers to any type of volatile memory or non-volatile memory.

In der Patentschrift werden die Ausdrücke „umfassen, umfasst, enthalten und umfassend“ oder eine beliebige Varianten davon und die Ausdrücke aufweisen, aufweist, enthalten oder aufweisend“ oder eine beliebige Variante davon als völlig austauschbar betrachtet und sie sollten die größtmögliche Deutung erhalten und umgekehrt.In the specification, the terms “comprising, comprising, containing, and comprising” or any variants thereof and the terms including, including, including or comprising ”or any variant thereof are considered entirely interchangeable and should be given the widest possible interpretation and vice versa.

Die Erfindung ist nicht auf die oben beschriebenen Ausführungsformen beschränkt, sondern kann sowohl in Bezug auf Konstruktion als auch Detail variiert werden.The invention is not restricted to the embodiments described above, but can be varied both in terms of construction and detail.

ZITATE ENTHALTEN IN DER BESCHREIBUNGQUOTES INCLUDED IN THE DESCRIPTION

Diese Liste der vom Anmelder aufgeführten Dokumente wurde automatisiert erzeugt und ist ausschließlich zur besseren Information des Lesers aufgenommen. Die Liste ist nicht Bestandteil der deutschen Patent- bzw. Gebrauchsmusteranmeldung. Das DPMA übernimmt keinerlei Haftung für etwaige Fehler oder Auslassungen.This list of the documents listed by the applicant was generated automatically and is included solely for the better information of the reader. The list is not part of the German patent or utility model application. The DPMA assumes no liability for any errors or omissions.

Zitierte PatentliteraturPatent literature cited

US 9697606 [0004]
US 10210401 [0004]

Claims

A method for fusing two-dimensional semantic information from images with a three-dimensional point cloud, comprising: Receiving multiple sequential images from one or more vision camera image sensors; Generating a three-dimensional point cloud based on the plurality of sequential images using motion stereo techniques; Clustering to separate distinct objects in the three-dimensional point cloud that are not close in the depth plane and applying a tentative label to each distinct object; Concurrently processing the received images using a trained convoluted neural network to produce two-dimensional bounding boxes with labels and surface maps for one or more detected objects on the received images; Merging the two-dimensional border boxes with labels and surface maps with one or more detected distinct objects on the three-dimensional point cloud.

Procedure according to Claim 1 wherein the fusing comprises back-projecting the two-dimensional bounding boxes with labels and surface maps on the three-dimensional point cloud; and filtering the determined objects based on the back projection using conditional random fields to provide a list of objects with a posteriori probability greater than a predetermined threshold.

Procedure according to Claim 1 or 2 , further comprising: creating a three-dimensional bounding box including each of the filtered distinct detected objects, wherein creating the three-dimensional bounding box comprises: for each filtered distinct detected object creating a plurality of candidate three-dimensional bounding boxes that include the filtered distinct detected object in the three-dimensional point cloud at least in part, wherein the three-dimensional boxes in question are generated based on one or more two-dimensional bordering boxes back-projected onto the distinct detected object; and selecting a three-dimensional bordering box from the plurality of candidate three-dimensional bordering boxes based on an optimization technique that maximizes intersection of merging the selected three-dimensional bordering box with the filtered distinct detected object, minimizing the distance between the selected three-dimensional bordering box and the center of the cluster of detected objects, and the confidence associated with the two-dimensional bordering box based on which the three-dimensional bordering box is generated.

Procedure according to Claim 3 , where the optimization technique is one of Ant Colony, Gauss-Newton, and Levenberg-Marquardt.

The method according to any one of the preceding claims, wherein the label of each border box comprises the class of the detected object and the confidence of the detection and classification of the detected object.

Method according to one of the preceding claims, wherein the surface maps provide precise localization of the detected objects on the three-dimensional point cloud.

Method according to one of the preceding claims, wherein fusing the two-dimensional border boxes with labels for one or more detected objects on the three-dimensional point cloud provides a semantic three-dimensional point cloud.

System for fusing two-dimensional semantic information from images with a three-dimensional point cloud, comprising: at least one vision camera image sensor for capturing multiple sequential images from one or more vision camera image sensors; a motion stereo unit for generating a three-dimensional point cloud based on the plurality of sequential images using motion stereo techniques; a 3D clustering unit for clustering to separate distinct objects in the three-dimensional point cloud that are not close in the depth plane and apply a preliminary label to each distinct object; a CNN classification unit for simultaneously processing the received images using a trained convoluted neural network to produce two-dimensional bordering boxes with labels and surface maps for one or more detected objects on the received images; a 3D model filtering unit for fusing the two-dimensional border boxes with labels and surface maps with one or more detected distinct objects on the three-dimensional point cloud.

System according to Claim 8 , further comprising: a 3D rear projection unit for rear projecting the two-dimensional border boxes with labels and surface maps on the three-dimensional point cloud; the 3D model filtering unit for filtering the determined objects on the basis of the back projection using conditional random fields to provide a list of objects with a posteriori probability of more than a predetermined threshold.

System according to Claim 8 or 9 , further comprising: a merger-based NMS unit for generating a three-dimensional bordering box including each of the filtered distinct detected objects, wherein generating the three-dimensional bordering box comprises: for each filtered distinct detected object, generating a plurality of candidate three-dimensional bordering boxes that include the at least partially enclosing filtered distinct detected objects in the three-dimensional point cloud, the three-dimensional boxes in question being generated based on one or more two-dimensional bordering boxes projected back onto the distinct detected object; and selecting a three-dimensional bordering box from the plurality of candidate three-dimensional bordering boxes based on an optimization technique that maximizes intersection of merging the selected three-dimensional bordering box with the filtered distinct detected object, minimizing the distance between the selected three-dimensional bordering box and the center of the cluster of detected objects, and the confidence associated with the two-dimensional bordering box based on which the three-dimensional bordering box is generated.

System according to Claim 10 , where the optimization technique is one of Ant Colony, Gauss-Newton, and Levenberg-Marquardt.

A system according to any one of the preceding claims, wherein the label of each border box comprises the class of the detected object and the confidence of the detection and classification of the detected object.

Apparatus for fusing two-dimensional semantic information from images with a three-dimensional point cloud, comprising: a vision camera image sensor for capturing multiple sequential images; a memory for storing instructions; a processor operatively coupled to the image sensor and memory, the stored instructions causing the processor to: Generating a three-dimensional point cloud based on the plurality of sequential images using motion stereo techniques; Clustering to separate distinct objects in the three-dimensional point cloud that are not close in the depth plane and applying a tentative label to each distinct object; Concurrently processing the received images using a trained convoluted neural network to produce two-dimensional bounding boxes with labels and surface maps for one or more detected objects on the received images; Merging the two-dimensional border boxes with labels and surface maps with one or more detected distinct objects on the three-dimensional point cloud.

Device according to Claim 13 wherein the fusing comprises back-projecting the two-dimensional bounding boxes with labels and surface maps on the three-dimensional point cloud; and filtering the determined objects based on the back projection using conditional random fields to provide a list of objects with a posteriori probability greater than a predetermined threshold.

Device according to Claim 13 and 14th wherein the instructions further cause the processor to generate a three-dimensional bordering box including each of the filtered distinct detected objects, wherein generating the three-dimensional bordering box comprises: for each filtered distinct detected object, generating a plurality of candidate three-dimensional bordering boxes that include the filtered distinct detected At least partially enclosing objects in the three-dimensional point cloud, the three-dimensional boxes in question being generated based on one or more two-dimensional bounding boxes projected back onto the distinct detected object; and selecting a three-dimensional bordering box from the plurality of candidate three-dimensional bordering boxes based on an optimization technique that maximizes intersection of merging the selected three-dimensional bordering box with the filtered distinct detected object, minimizing the distance between the selected three-dimensional bordering box and the center of the cluster of detected objects, and the confidence associated with the two-dimensional bordering box based on which the three-dimensional bordering box is generated.

Device according to Claim 15 , where the optimization technique is one of Ant Colony, Gauss-Newton, and Levenberg-Marquardt.

Device according to one of the preceding claims, wherein the label of each border box comprises the class of the detected object and the confidence of the detection and classification of the detected object.

Computer program that comprises instructions which, when the program is executed by a data processing device / a computer, cause the data processing device / the computer to carry out the steps of the method according to one of the Claims 1 to 7th execute.

Computer readable medium comprising instructions which, when executed by a computer, cause the computer to perform the steps of the method according to any one of Claims 1 to 7th execute.