WO2021242132A1 - Processing a data stream of scans containing spatial information provided by a 2d or 3d sensor configured to measure distance by using a convolutional neural network (cnn) - Google Patents

Processing a data stream of scans containing spatial information provided by a 2d or 3d sensor configured to measure distance by using a convolutional neural network (cnn) Download PDF

Info

Publication number
WO2021242132A1
WO2021242132A1 PCT/RU2020/000256 RU2020000256W WO2021242132A1 WO 2021242132 A1 WO2021242132 A1 WO 2021242132A1 RU 2020000256 W RU2020000256 W RU 2020000256W WO 2021242132 A1 WO2021242132 A1 WO 2021242132A1
Authority
WO
WIPO (PCT)
Prior art keywords
tensor
cnn
current
data
layer
Prior art date
Application number
PCT/RU2020/000256
Other languages
French (fr)
Inventor
Dmitrii Akhmirovich KHIZBULLIN
Mikhail Viktorovich PIKHLETSKY
Sergey Valerevich MOROZOV
Xinli HAN
Zuguang WU
Peng Zhou
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/RU2020/000256 priority Critical patent/WO2021242132A1/en
Priority to CN202080101444.8A priority patent/CN115836299A/en
Publication of WO2021242132A1 publication Critical patent/WO2021242132A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/955Hardware or software architectures specially adapted for image or video understanding using specific electronic processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/17Terrestrial scenes taken from planes or by drones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle

Definitions

  • the present disclosure relates to a device for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance, wherein the device is configured to employ a Convolutional Neural Network (CNN) in an inference phase.
  • the present disclosure further relates to an ego-vehicle comprising such a device and one or more 2D or 3D sensors configured to measure distance.
  • the present disclosure furthermore relates to a hardware implementation of a Convolutional Neural Network (CNN) for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance.
  • CNN Convolutional Neural Network
  • the present disclosure relates to a method of employing a Convolutional Neural Network (CNN) in an inference phase for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance.
  • CNN Convolutional Neural Network
  • the present disclosure relates to a computer program comprising program code for performing such a method, to a non- transitory storage medium storing executable program code which, when executed by a processor, causes such a method to be performed, and to a computer comprising a memory and a processor, which are configured to store and execute program code to perform such a method.
  • a data stream of scans containing spatial information provided by a 2D or 3D distance sensor can be processed by means of a neural network.
  • the spatial information is spatial information about an environment or vicinity of the 2D or 3D sensor. Objects in the environment of the 2D or 3D sensor can be detected on the basis of a processing result of the processing of the data stream.
  • the 2D or 3D sensor may be installed on an ego-vehicle.
  • the spatial information provided by the 2D or 3D sensor corresponds to spatial information about the environment of the ego-vehicle and, thus, on the basis of the processing result of the processing of the data stream objects, such as cars, pedestrian, bicyclists, motorcyclist etc., in the environment of the ego-vehicle may be detected.
  • ego vehicle refers to a vehicle that is equipped with one or more sensors (e.g. one or more 2D or 3D distance sensors) for sensing an environment of the vehicle and which operates based on data from those sensors and not necessarily based on any other data about its environment. In other words, an ego vehicle operates based on its own “view” of its environment.
  • sensors e.g. one or more 2D or 3D distance sensors
  • CNN Convolutional Neural Network
  • Figure 1 shows an example of an operation of a Convolutional Neural Network 3 (CNN) at a current time point ti.
  • CNN Convolutional Neural Network 3
  • present may be used as a synonym for the term “current”. That is, e.g. the term “present time point” may be used as a synonym for the term “current time point”.
  • the CNN 3 according to Figure 1 comprises four layers LI, L2a, L2b and L2c, wherein at each layer one or more convolutional operations are performed.
  • data pc(ti) of a current scan are provided by the 2D or 3D sensor and one or more convolutional operations are performed at each layer of the CNN 3 on the basis of a current tensor originating from the data pc(ti) of the current scan (scan provided at time point ti) and a previous tensor originating from data pc(t-i) of a previous scan (scan provided at time point ti-i) inputted to the CNN 3 directly before the data of the current scan, wherein the current tensor and the previous tensor are provided to the respective layer.
  • the one or more convolutional operations of a layer are indicated in Figure 1 by two arrows originating from the respective two tensors on the basis of which the one or more convolutional operations are performed.
  • the data pc(ti) of the scan of the time point ti is provided as the data of the current scan by the 2D or 3D sensor and the tensor epc(ti) originating from the data pc(ti) of the current scan is input to the CNN 2 and, thus, is provided to the first layer LI.
  • the previous tensor epc(ti-i) originating from the data pc(ti-i) of the previous scan provided by the 2D or 3D sensor at the previous time point ti-i directly before the current scan of the time point ti is provided to the first layer LI, wherein at the first layer LI one or more convolutional operations of the first layer LI are performed on the basis of the current tensor epc(ti) and the previous tensor epc(ti-i).
  • the previous tensor epc(ti-i) In order to be able to use at the current time point ti the previous tensor epc(ti-i) for the one or more convolutional operations of the first layer LI, the previous tensor epc(ti-i) has to be generated again respectively has to be re-computed at the current time point ti. In the case of the first layer LI this requires performing a voxelization VX on the basis of the data pc(ti-i) of the previous scan.
  • the voxelization VX is indicated in Figure 1 by a single arrow originating from the respective data on the basis of which the voxelization is performed.
  • the following re-computation has to be done: a voxelization VX of data pc(ti-i), pc(ti-2), pc(ti-3) and pc(ti-4) of four directly subsequent previous scans that were provided by the
  • the CNN 3 has four layers LI, L2a, L2b and L2c. This is only an example. According to an embodiment of the invention, the CNN 3 may have more than four layers. The greater the number of layers of the CNN the greater the number of the above mentioned re-processing respectively re-computation steps, which have to be performed on the basis of previous data of previous scans when processing the data of a current scan at a current time point tj using the CNN 3.
  • a CNN for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance has the disadvantage of requiring a high amount of computational resources. This is especially the case, when the 2D or 3D sensor provides scans at a frame rate of at least 10 to 20 frames per second.
  • frame and “scan” may be used as synonyms.
  • embodiments of the present invention aim to improve the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance.
  • An objective is to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance.
  • a first aspect of the present disclosure provides a device for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance, wherein the device is configured to employ a Convolutional Neural Network (CNN) in an inference phase.
  • the CNN comprises a first layer and one or more further layers following the first layer and one or more first buffers for storing an output tensor of a respective preceding layer.
  • the device is configured to input data of a current scan provided by the 2D or 3D sensor in the form of a current tensor into the CNN. That is, the device is configured to input the data of the current scan in the form of the current tensor into the CNN.
  • the device is configured to perform, at each further layer, one or more convolutional operations on the basis of a current output tensor of the preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer. That is, the device is configured to perform, at each further layer, one or more convolutional operations on the basis of the current output tensor of the preceding layer and the previous output tensor of the preceding layer.
  • the previous output tensor of the preceding layer is the newest tensor stored in the respective first buffer of the one or more first buffers and originates from data of a previous scan inputted to the CNN directly before the data of the current scan.
  • the device is configured to store the current output tensor of the preceding layer as the newest tensor in the respective first buffer.
  • the device allows to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. Since the CNN comprises one or more first buffers for storing the output tensor of a respective preceding layer, the need of re-computation of one or more previous tensors, in particular output tensors, of respective preceding layers for performing one or more convolutional operations at each further layer at a current time point is overcome.
  • the device is configured to perform, at each further layer, one or more convolutional operations on the basis of the current output tensor of the preceding layer originating from the data of the current scan and the previous output tensor of the preceding layer being the newest tensor stored in the respective first buffer.
  • the inference time corresponds to the time required by the CNN for providing output data (the output tensor of the last layer of the CNN) starting from the current tensor originating from the data of the current scan and being input into the CNN at the current time.
  • the computational costs may be reduce from K 2 /2 (i.e.
  • K is the number of aggregated scans respectively sweeps provided by the 2D or 3D sensor.
  • K is the number of aggregated LIDAR sweeps. Therefore, the device according to the first aspect allows a real time inference for high values of K, e.g. between 10 and 100 scans.
  • the passage “a tensor input into the CNN ” may be understood as “a tensor that is input into the CNN ⁇
  • the passage “a tensor input to the CN ’ may be used as a synonym for the passage “a tensor input into the CNN’.
  • the first layer and the one or more further layers of the CNN are the layers of the CNN. Descriptions made herein with respect to a layer referred to by the general term “layer” are valid for the first layer as well as the one or more further layers.
  • the CNN comprises one or more optional additional layers, wherein the device is configured to perform, at each optional additional layer, one or more convolutional operations on the basis of a current output tensor of a preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer originating from data of a previous scan inputted to the CNN directly before the data of the current scan.
  • the first layer, the one or more further layers and the one or more optional additional layers of the CNN are the layers of the CNN. Descriptions made herein with respect to a layer referred to by the general term “layer” may also be valid for the one or more optional additional layers.
  • the 2D or 3D sensor configured to measure distance may comprise or correspond to one or more LIDAR sensors (light detection and ranging sensors), TOF cameras (time of flight cameras), stereo cameras and/or beamforming radars. That is, the 2D or 3D sensor may comprise or correspond to one or more visual-depth-capable sensors. The 2D or 3D sensor may provide scans at a frame rate of at least 10 to 20 frames per second.
  • LIDAR sensors light detection and ranging sensors
  • TOF cameras time of flight cameras
  • stereo cameras stereo cameras and/or beamforming radars. That is, the 2D or 3D sensor may comprise or correspond to one or more visual-depth-capable sensors.
  • the 2D or 3D sensor may provide scans at a frame rate of at least 10 to 20 frames per second.
  • the term “2D sensor configured to measure distance” may be understood to correspond to a sensor that is configured to detect scans with two spatial dimensions (2 -dimensional scans respectively frames) and to measure distance.
  • the term “2D sensor configured to measure distance” may be understood to correspond to a sensor that is configured to detect scans with three spatial dimensions (3-dimensional scans respectively frames) and to measure distance.
  • the CNN is a feed-forward neural network that may be described with a directed acyclic graph (DAG).
  • DAG directed acyclic graph
  • An advantage of a DAG neural network is that it has a finite impulse response operator and relates to finite impulse response filters (FIR filters).
  • a buffer is configured to buffer respectively store data, such as tensors.
  • a buffer may be a data structure for buffering data, in particular tensors.
  • the terms “ buffer storage ”, “ rolling buffer ” and “ rolling buffer storage ” may be used to refer to a buffer.
  • one or more optional further operations such as one or more normalization operations and/or one or more activation operations, may be performed.
  • activation of a layer’ ’ ’ or “ layer output ” may be used as synonyms for the output tensor of a layer.
  • the CNN comprises a second buffer for storing a tensor input to the CNN
  • the device is configured to perform, at the first layer, one or more convolutional operations on the basis of the current tensor and a previous tensor generating a current output tensor of the first layer originating from the data of the current scan.
  • the previous tensor is the newest tensor stored in the second buffer and corresponds to the data of the previous scan.
  • the device is configured to store the current tensor as the newest tensor in the second buffer.
  • the device allows to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance.
  • the CNN comprises a second buffer for storing the tensor input to the CNN, the need of recomputation of one or more previous tensors input to the CNN on the basis of the corresponding data of the corresponding previous scan for performing one or more convolutional operations at the first layer at a current time point is overcome.
  • the device is configured to perform, at the first layer, one or more convolutional operations on the basis of the current tensor and the previous tensor being the newest tensor stored in the second buffer.
  • buffer Descriptions made herein with respect to a buffer referred to by the general term “ buffer ” are valid for the one or more first buffers as well as for the second buffer.
  • the number of first buffers of the CNN corresponds to the number of further layers of the CNN.
  • the number of buffers of the CNN corresponds to the number of one or more first buffers and the second buffer.
  • each buffer is a serial-in parallel-out buffer configured to store a new tensor as newest tensor and, in case the buffer is full, to simultaneously drop the oldest tensor stored in the buffer.
  • each buffer is a serial-in parallel-out (SIPO) shift register.
  • the temporal size of each of the one or more first buffers for storing the output tensor of a respective preceding layer is one less than the temporal size of a convolutional kernel of the respective preceding layer.
  • one or more buffers have a different temporal size.
  • convolutional matrix' ' may be used to refer to a convolutional kernel.
  • the convolutional kernel is used at a layer of the CNN for performing the one or more convolutional operations of the layer.
  • the temporal size of a first buffer corresponds to the number of consecutive respectively directly subsequent time points for which the first buffer is configured to store the output tensor of the respective layer (preceding layer). Therefore, the temporal size may also be defined in terms of the number of directly subsequent scans of the 2D or 3D sensor for which the first buffer may store the output tensor of the respective layer (preceding layer). For example, if the temporal size of a first buffer is one (“7”) then the first buffer may only store one output tensor of a respective layer originating from data of one scan of one time point. This reduces the storage consumption, such as RAM consumption, to a minimum.
  • the first buffer when storing the current output tensor (originating from data of a current scan of a current time point) of a layer (preceding layer) as the newest tensor in the respective first buffer, the previous output tensor of the layer (originating from data of the previous scan inputted to the CNN directly before the data of the current scan) already stored in the respective first buffer is dropped respectively deleted, because the first buffer is already full. For example, if the temporal size of a first buffer is three (“3”) then the first buffer may store three output tensors of a respective layer (preceding layer) originating from data of three consecutive scans of three consecutive time points.
  • the temporal size of a buffer corresponds to the number of directly subsequent time points for which the buffer may store a tensor. Accordingly, the temporal size of a buffer corresponds to the number of directly subsequent scans of the 2D or 3D sensor for which the first buffer may store a corresponding tensor.
  • each tensor is a tensor with two or more dimensions.
  • each tensor is a tensor with one or more spatial dimensions and one channel dimension.
  • the number of spatial dimensions of a tensor may equal to the spatial dimensions of the data of a corresponding scan provided by the 2D or 3D sensor configured to measure distance from which the tensor originates from.
  • the channel dimension corresponds to the number of channel and is greater than or equal to one (channel dimension > 1).
  • the data of a scan corresponds to a point cloud, wherein a point cloud is a set of points in an N-dimensional space along with their attributes, wherein N corresponds to the number of spatial dimensions.
  • the number N of spatial dimensions may correspond to one spatial dimension, two spatial dimensions or three spatial dimensions.
  • the data with the two spatial dimensions may correspond to a bird’s eye view (BEV) representation of a point cloud.
  • the data with the three spatial dimensions may correspond to a volumetric representation of a point cloud, e.g. for a flying capable vehicle, such as a flying capable robot, drones, aircraft etc.
  • a point cloud with three spatial dimensions may be produced by a 2D or 3D sensor configured to measure distance, such as a LIDAR sensor, wherein a wave or a beam, such as a light beam, bounces back from an obstacle in the environment of the 2D or 3D sensor to produce a point with three spatial dimensions with its location in meters and a scalar reflection brightness attribute.
  • a point cloud comprising a plurality, e.g. thousands, of points with three spatial dimensions and a scalar reflection brightness attribute may correspond to a tensor with four dimensions. The four dimensions corresponds to the three spatial dimensions and one channel dimension for the scalar reflection brightness attribute.
  • each tensor is a tensor with two dimensions, in particular with one spatial dimension and one channel dimension.
  • each tensor is a tensor with three dimensions, in particular with two spatial dimensions and one channel dimension.
  • each tensor is a tensor with four dimensions, in particular with three spatial dimensions and one channel dimension.
  • the channel dimension of each first buffer for storing the output tensor of a respective preceding layer corresponds to the channel dimension of the respective preceding layer.
  • the device is configured to generate output data for a navigation process of an ego-vehicle, on which the 2D or 3D sensor is arranged, wherein the device is configured to input, together with the data of the current scan, current location data of the ego-vehicle in a grid of a local navigational frame to the CNN, and pad and crop in each buffer the newest tensor being stored, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan.
  • ego-vehicle may be understood as a mobile platform, in particular mobile robotic platform, bearing one or more sensors, such as one or more 2D or 3D sensors configured to measure distance, that computes respectively operates from the perspective of which the world respectively environment is perceived.
  • the ego-vehicle may correspond to a vehicle, such as a car, truck, motorcycle etc., an autonomous vehicle, such as an autonomous car, autonomous truck etc., a robot, such as a delivery robot, an autonomous robot, such as an autonomous delivery robot, a flying capable vehicle, such as a flying capable drone, flying capable robot, aircraft etc., or an autonomous flying capable vehicle, such as an autonomous flying capable drone, an autonomous flying capable robot, autonomous aircraft etc.
  • a vehicle such as a car, truck, motorcycle etc.
  • an autonomous vehicle such as an autonomous car, autonomous truck etc.
  • a robot such as a delivery robot
  • an autonomous robot such as an autonomous delivery robot
  • a flying capable vehicle such as a flying capable drone, flying capable robot, aircraft etc.
  • an autonomous flying capable vehicle such as an autonomous flying capable drone, an autonomous flying capable robot, autonomous aircraft etc.
  • the ego-vehicle may comprise a localization unit configured to determine the current location of the ego-vehicle and, thus, the current location data of the ego-vehicle.
  • the localization unit may be configured for a short-term localization, e.g. at the scope of Is to 1 Os.
  • the localization unit may comprise or correspond to one or more inertial measurement units (IMU).
  • IMU inertial measurement units
  • the localization unit is configured to perform an odometry process, such as an inertial odometry process, a wheel odometry process and/or an optical respectively visual odometry process etc.
  • An odometry process is a process of understanding an ego location, i.e. the location of the ego-vehicle, based on sensory (e.g. wheel, inertial) information.
  • the term “local navigational frame” may be understood as a coordinate system tied to the ground.
  • the local navigational frame may correspond to a 2-dimensional coordinate system with top-down view on the ground surface (as shown in Figure 7).
  • the grid of the local navigational frame is a regular grid composed of a plurality of cells.
  • the term “local navigational coordinate frame ” may be used as a synonym for the local navigational frame.
  • the device is configured to zero-pad and crop in each buffer the newest tensor being stored, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan.
  • the grid of the local navigational frame is a regular grid composed of a plurality of cells and in case the ego-vehicle moves within the same cell of the grid, there is no padding and cropping performed by the device.
  • the device does not perform padding and cropping.
  • the device is configured to control autonomous movement of the ego-vehicle on the basis of the processing of the data stream.
  • the device is configured to store the location data in a location field of one or more of the one or more first buffers and the second buffer, and update the location field of the respective one or more buffers with the current location data in case the current location data of the ego-vehicle do not match the previous location data.
  • the grid of the local navigational frame is a regular grid composed of a plurality of cells and in case the ego-vehicle moves within the same cell of the grid, there is no updating of the location field of the respective one or more buffers.
  • the device in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan, but the current location data and the previous location data of the ego-vehicle lie respectively are located within the same cell of the grid, the device does not update the location field of the respective one or more buffers with the current location data.
  • the device is configured to generate on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by performing a change of coordinates of the data of the current scan into the local navigational frame.
  • the device is configured to generate on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by additionally performing a voxelization.
  • the device is configured to generated on the basis of the data of the current scan the current tensor input to the CNN by performing a voxelization.
  • the data of the current scan corresponds to a point cloud, wherein a point cloud is a set of points in an N-dimensional space along with their attributes, wherein N corresponds to the number of spatial dimensions.
  • the device may be configured to generate on the basis of the point cloud of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by performing a change of coordinates of the point cloud of the current scan into the local navigational frame.
  • the current tensor input to the CNN may correspond to an encoded point cloud.
  • An encoded point cloud may be understood as the result of the transformation of a raw respectively unordered point cloud into a voxelized respectively ordered format.
  • Voxelization is a transformation of an unordered set of points, such as the unordered points of a point cloud into a regular grid.
  • voxelization is a transformation of an unordered set of points with N spatial dimensions, such as the unordered points of a point cloud with N spatial dimensions, into a regular N-dimensional grid.
  • Voxelization may be performed by pillar encoding or voxel feature encoding.
  • the number N of spatial dimensions may correspond to one spatial dimension, two spatial dimensions or three spatial dimensions.
  • the device is configured to detect targets, in particular moving targets, in the vicinity of the 2D or 3D sensor on the basis of the output tensor of the last layer of the one or more further layers of the CNN.
  • the device is configured to perform a point cloud semantic segmentation on the basis of the output tensor of the last layer of the one or more further layers of the CNN.
  • the device is configured to perform a free space estimation on the basis of the output tensor of the last layer of the one or more further layers of the CNN.
  • the device is configured to detect weak target objects in the vicinity of the 2D or 3D sensor on the basis of the output tensor of the last layer of the one or more further layers of the CNN.
  • a weak target object is an object with low spatial information but significant temporal information.
  • a weak target object is a moving object, such as a car, pedestrian including a child, bicyclist including a child, motorcyclist, etc., that has few points, in particular between 1 to 5 points, more particular between 1 to 10 points, detectable by the 2D or 3D sensor.
  • a weak target object is an object that has few LIDAR echo points, in particular 1 to 5 LIDAR echo points, more particular between 1 to 10 LIDAR echo points.
  • LIDAR echo points in particular 1 to 5 LIDAR echo points, more particular between 1 to 10 LIDAR echo points.
  • a weak target object In fog, rain and snow as well as in a case of heavy occlusions even normally well detectable or visible (by the 2D or 3D sensor such as a LIDAR sensor) objects may become weak target objects.
  • a weak target object may be defined as a moving object, e.g. from a list of known object categories (classes), that has between 1 and 5, in particular between 1 and 10, LIDAR echo points falling on it.
  • a second aspect of the present disclosure provides a hardware implementation of a Convolutional Neural Network (CNN) for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance.
  • the CNN comprises a first layer and at least one or more further layers following the first layer and one or more first buffers for storing an output tensor of a respective preceding layer.
  • the hardware implementation of the CNN is configured to input data of a current scan provided by the 2D or 3D sensor in the form of a current tensor into the CNN.
  • the hardware implementation of the CNN is configured to perform, at each further layer, one or more convolutional operations on the basis of a current output tensor of the preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer.
  • the previous output tensor of the preceding layer is the newest tensor stored in the respective first buffer of the one or more first buffers and originates from data of a previous scan inputted to the CNN directly before the data of the current scan.
  • the hardware implementation of the CNN is configured to store the current output tensor of the preceding layer as the newest tensor in the respective first buffer.
  • the CNN comprises one or more optiohal additional layers, wherein the hardware implementation of the CNN is configured to perform, at each optional additional layer, one or more convolutional operations on the basis of a current output tensor of a preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer originating from data of a previous scan inputted to the CNN directly before the data of the current scan.
  • the first layer, the one or more further layers and the one or more optional additional layers of the CNN are the layers of the CNN.
  • the CNN comprises a second buffer for storing a tensor input to the CNN
  • the hardware implementation of the CNN is configured to perform, at the first layer, one or more convolutional operations on the basis of the current tensor and a previous tensor generating a current output tensor of the first layer originating from the data of the current scan.
  • the previous tensor is the newest tensor stored in the second buffer and corresponds to the data of the previous scan.
  • the hardware implementation of the CNN is configured to store the current tensor as the newest tensor in the second buffer.
  • each buffer is a serial-in parallel-out buffer configured to store a new tensor as newest tensor and, in case the buffer is full, to simultaneously drop the oldest tensor stored in the buffer.
  • the temporal size of each of the one or more first buffers for storing the output tensor of a respective preceding layer is one less than the temporal size of a convolutional kernel of the respective preceding layer.
  • one or more buffers have a different temporal size.
  • each tensor is a tensor with two or more dimensions.
  • each tensor is a tensor with one or more spatial dimensions and one channel dimension.
  • each tensor is a tensor with two dimensions, in particular with one spatial dimension and one channel dimension.
  • each tensor is a tensor with three dimensions, in particular with two spatial dimensions and one channel dimension.
  • each tensor is a tensor with four dimensions, in particular with three spatial dimensions and one channel dimension.
  • the channel dimension of each first buffer for storing the output tensor of a respective preceding layer corresponds to the channel dimension of the respective preceding layer.
  • the hardware implementation of the CNN is configured to generate output data for a navigation process of an ego-vehicle, on which the 2D or 3D sensor is arranged, wherein the hardware implementation of the CNN is configured to input, together with the data of the current scan, current location data of the ego-vehicle in a grid of a local navigational frame to the CNN, and pad and crop in each buffer the newest tensor being stored, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan.
  • the hardware implementation of the CNN is configured to control autonomous movement of the ego-vehicle on the basis of the processing of the data stream.
  • the hardware implementation of the CNN is configured to store the location data in a location field of one or more of the one or more first buffers and the second buffer, and update the location field of the respective one or more buffers with the current location data in case the current location data of the ego- vehicle do not match the previous location data.
  • the hardware implementation of the CNN is configured to generate on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by performing a change of coordinates of the data of the current scan into the local navigational frame.
  • the hardware implementation of the CNN is configured to generate on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by additionally performing a voxelization.
  • the hardware implementation of the CNN is configured to detect targets, in particular moving targets, in the vicinity of the 2D or 3D sensor on the basis of the output tensor of the last layer of the one or more further layers of the CNN.
  • the hardware implementation of the CNN is configured to perform a point cloud semantic segmentation on the basis of the output tensor of the last layer of the one or more further layers of the CNN.
  • the hardware implementation of the CNN is configured to perform a free space estimation on the basis of the output tensor of the last layer of the one or more further layers of the CNN.
  • the hardware implementation of the CNN of the second aspect and its implementation forms and optional features achieve the same advantages as the device of the first aspect and its respective implementation forms and respective optional features.
  • the implementation forms and optional features of the device according to the first aspect are correspondingly valid for the hardware implementation of the CNN according to the second aspect.
  • a third aspect of the present disclosure provides an ego-vehicle comprising one or more 2D or 3D sensors configured to measure distance, and a device according to the first aspect or any implementation form thereof.
  • the one or more 2D or 3D sensors are configured to provide scans containing spatial information of the vicinity of the ego-vehicle in the form of a data stream to the device and the device is configured to process the data stream.
  • ego-vehicle may be understood as a mobile platform, in particular mobile robotic platform, bearing one or more sensors, such as one or more 2D or 3D sensors configured to measure distance, that computes respectively operates from the perspective of which the world respectively environment is perceived.
  • the ego-vehicle may correspond to a vehicle, such as a car, truck, motorcycle etc., an autonomous vehicle, such as an autonomous car, autonomous truck etc., a robot, such as a delivery robot, an autonomous robot, such as an autonomous delivery robot, a flying capable vehicle, such as a flying capable drone, flying capable robot, aircraft etc., or a autonomous flying capable vehicle, such as an autonomous flying capable drone, an autonomous flying capable robot, autonomous aircraft etc.
  • a vehicle such as a car, truck, motorcycle etc.
  • an autonomous vehicle such as an autonomous car, autonomous truck etc.
  • a robot such as a delivery robot
  • an autonomous robot such as an autonomous delivery robot
  • a flying capable vehicle such as a flying capable drone, flying capable robot, aircraft etc.
  • a autonomous flying capable vehicle such as an autonomous flying capable drone, an autonomous flying capable robot, autonomous aircraft etc.
  • the ego-vehicle may comprise a localization unit configured to determine the current location of the ego-vehicle and, thus, the current location data of the ego-vehicle.
  • the localization unit may be configured for a short-term localization, e.g. at the scope of Is to 10s.
  • the localization unit may comprise or correspond to one or more inertial measurement units (IMU).
  • IMU inertial measurement units
  • the localization unit is configured to perform an odometry process, such as an inertial odometry process, a wheel odometry process and/or an optical respectively visual odometry process etc.
  • the one or more 2D or 3D sensors configured to measure distance may each comprise or correspond to one or more LIDAR sensors (light detection and ranging sensors), TOF cameras (time of flight cameras), stereo cameras and/or beamforming radars. That is, the one or more 2D or 3D sensors may each comprise or correspond to one or more visual- depth-capable sensors.
  • the 2D or 3D sensor may provide scans at a frame rate of at least 10 to 20 frames per second.
  • the term “2D sensor configured to measure distance ” may be understood to correspond to a sensor that is configured to detect scans with two spatial dimensions (2 -dimensional scans respectively frames) and to measure distance.
  • the term “3D sensor configured to measure distance ” may be understood to correspond to a sensor that is configured to detect scans with three spatial dimensions (3 -dimensional scans respectively frames) and to measure distance.
  • the device is configured to control autonomous movement of the ego-vehicle on the basis of the processing of the data stream.
  • the ego-vehicle of the third aspect and its implementation forms and optional features achieve the same advantages as the device of the first aspect and its respective implementation forms and respective optional features.
  • a fourth aspect of the present disclosure provides a method of employing a Convolutional Neural Network, CNN, in an inference phase, for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance.
  • the CNN comprises a first layer and at least one or more further layers following the first layer and one or more first buffers for storing an output tensor of a respective preceding layer.
  • the method comprises the steps of inputting data of a current scan provided by the 2D or 3D sensor in the form of a current tensor into the CNN, performing, at each further layer, one or more convolutional operations on the basis of a current output tensor of the preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer, the previous output tensor of the preceding layer being the newest tensor stored in the respective first buffer of the one or more first buffers and originating from data of a previous scan inputted to the CNN directly before the data of the current scan, and storing the current output tensor of the preceding layer as the newest tensor in the respective first buffer.
  • the CNN comprises one or more optional additional layers
  • the method comprises the step of performing, at each optional additional layer, one or more convolutional operations on the basis of a current output tensor of a preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer originating from data of a previous scan inputted to the CNN directly before the data of the current scan.
  • the first layer, the one or more further layers and the one or more optional additional layers of the CNN are the layers of the CNN.
  • the CNN comprises a second buffer for storing a tensor input to the CNN
  • the method comprises the steps of performing, at the first layer, one or more convolutional operations on the basis of the current tensor and a previous tensor generating a current output tensor of the first layer originating from the data of the current scan, the previous tensor being the newest tensor stored in the second buffer and corresponding to the data of the previous scan, and storing the current tensor as the newest tensor in the second buffer.
  • each buffer is a serial-in parallel-out buffer and the method comprises the steps of storing a new tensor as newest tensor, and in case the buffer is full, simultaneously dropping the oldest tensor stored in the buffer.
  • the temporal size of each of the one or more first buffers for storing the output tensor of a respective preceding layer is one less than the temporal size of a convolutional kernel of the respective preceding layer.
  • one or more buffers have a different temporal size.
  • each tensor is a tensor with two or more dimensions.
  • each tensor is a tensor with one or more spatial dimensions and one channel dimension.
  • each tensor is a tensor with two dimensions, in particular with one spatial dimension and one channel dimension.
  • each tensor is a tensor with three dimensions, in particular with two spatial dimensions and one channel dimension.
  • each tensor is a tensor with four dimensions, in particular with three spatial dimensions and one channel dimension.
  • the channel dimension of each first buffer for storing the output tensor of a respective preceding layer corresponds to the channel dimension of the respective preceding layer.
  • output data for a navigation process of an ego-vehicle, on which the 2D or 3D sensor is arranged are generated, wherein the method comprises the steps of inputting, together with the data of the current scan, current location data of the ego- vehicle in a grid of a local navigational frame to the CNN, and padding and cropping in each buffer the newest tensor being stored, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan.
  • the method comprises the step of controlling autonomous movement of the ego-vehicle on the basis of the processing of the data stream.
  • the method comprises the steps of storing the location data in a location field of one or more of the one or more first buffers and the second buffer, and updating the location field of the respective one or more buffers with the current location data in case the current location data of the ego-vehicle do not match the previous location data.
  • the method comprises the step of generating on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by performing a change of coordinates of the data of the current scan into the local navigational frame.
  • the method comprises the step of generating on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by additionally performing a voxelization.
  • the method comprises the step of detecting targets, in particular moving targets, in the vicinity of the 2D or 3D sensor on the basis of the output tensor of the last layer of the one or more further layers of the CNN.
  • the method comprises the step of performing a point cloud semantic segmentation on the basis of the output tensor of the last layer of the one or more further layers of the CNN.
  • the method comprises the step of performing a free space estimation on the basis of the output tensor of the last layer of the one or more further layers of the CNN.
  • a fifth aspect of the present disclosure provides a computer program comprising program code for performing the method according to the fourth aspect or any of its implementation forms.
  • the fifth aspect of the present disclosure provides a computer program comprising program code for performing when implemented on a processor, the method according to the fourth aspect or any of its implementation forms.
  • a sixth aspect of the present disclosure provides a computer program product comprising program code for performing when implemented on a processor, a method according to the fourth aspect or any implementation form thereof.
  • a seventh aspect of the present disclosure provides a non-transitory storage medium storing executable program code which, when executed by a processor, causes the method according to the fourth aspect or any of its implementation forms to be performed.
  • An eighth aspect of the present disclosure provides a computer comprising a memory and a processor, which are configured to store and execute program code to perform a method according to the fourth aspect or any implementation form thereof.
  • the memory may be distributed over a plurality of physical devices.
  • a plurality of processors that co-operate in executing the program code may be referred to as a processor. It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities.
  • Figure 1 shows an example of an operation of a Convolutional Neural Network (CNN) at a current time point ti.
  • CNN Convolutional Neural Network
  • Figure 2 shows a device according to an embodiment of the invention and an ego- vehicle according to an embodiment of the invention.
  • Figure 3A shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
  • CNN Convolutional Neural Network
  • Figure 3B shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
  • CNN Convolutional Neural Network
  • Figure 4 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
  • Figure 5 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
  • Figure 6 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
  • CNN Convolutional Neural Network
  • Figure 7 shows a grid of a local navigational frame with a location of an ego-vehicle according to an embodiment of the invention for two directly subsequent time points.
  • Figure 8 shows a diagram of a method according to an embodiment of the invention with respect to storing tensors originating from data of a current scan of a current time point ti in buffers of a Convolutional Neural Network (CNN) according to an embodiment of the invention.
  • CNN Convolutional Neural Network
  • Figure 9 shows a diagram of a method of employing a Convolutional Neural Network
  • CNN in an inference phase according to an embodiment of the invention for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance.
  • Figure 2 shows on the left side a device 1 according to an embodiment of the invention and on the right side an ego-vehicle 5 according to an embodiment of the invention.
  • the device 1 comprises a Convolutional Neural Network 3 (CNN) and is configured to employ the CNN in inference phase.
  • the device 1 is configured to receive and process a data stream of scans containing spatial information provided by a 2D or 3D sensor 2 configured to measure distance.
  • CNN Convolutional Neural Network 3
  • the CNN 3 of the device 1 comprises a first layer and one or more further layers following the first layer (not shown in Figure 2).
  • the CNN 3 of the device 1 further comprises one or more first buffers 4a for storing an output tensor of a respective preceding layer.
  • the device 1 is configured to input data of a current scan, which is provided by the 2D or 3D sensor 2, in the form of a current tensor into the CNN 3, perform, at each further layer of the CNN 3, one or more convolutional operations on the basis of a current output tensor of the preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer, the previous output tensor of the preceding layer being the newest tensor stored in the respective first buffer 4a of the one or more first buffers 4a and originating from data of a previous scan inputted to the CNN 3 directly before the data of the current scan, and store the current output tensor of the preceding layer as the newest tensor in the respective first buffer 4a.
  • Embodiments of the CNN 3, in particular an operation of embodiments of the CNN 3, are shown in the Figures 3A, 3B, 4, 5 and 6.
  • the device 1 of Figure 2 allows to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by the 2D or 3D sensor 2. Since the CNN 3 comprises one or more first buffers 4a for storing the output tensor of a respective preceding layer, the need of re-computation of one or more previous tensors, in particular output tensors, of respective preceding layers for performing one or more convolutional operations at each further layer at a current time point is overcome.
  • the device 1 is configured to perform, at each further layer, one or more convolutional operations on the basis of the current output tensor of the preceding layer originating from the data of the current scan and the previous output tensor of the preceding layer being the newest tensor stored in the respective first buffer 4a.
  • the inference time of the CNN 3 in the inference phase is reduced.
  • the device 1 and the 2D or 3D sensor 2 may be part of the ego-vehicle 5. That is, the ego-vehicle 5 comprises the 2D or 3D sensor 2 and the device 1. The ego-vehicle 5 may also comprise more than one 2D or 3D sensor 2. Therefore, the ego-vehicle 5 comprises one or more 2D or 3D sensors 2 configured to measure distance and the device 1, wherein the one or more 2D or 3D sensors 2 are configured to provide scans containing spatial information of the vicinity respectively environment of the ego-vehicle 5 in the form of a data stream to the device 1 and the device 1 is configured to process the data stream.
  • the one or more 2D or 3D sensors 2 configured to measure distance may each comprise or correspond to one or more LIDAR sensors (light detection and ranging sensors), TOF cameras (time of flight cameras), stereo cameras and/or beamforming radars. That is, the one or more 2D or 3D sensors 2 may each comprise or correspond to one or more visual- depth-capable sensors. The one or more 2D or 3D sensors 2 may provide scans at a frame rate of at least 10 to 20 frames per second.
  • the ego-vehicle 5 may correspond to a vehicle, such as a car, truck, motorcycle etc., an autonomous vehicle, such as an autonomous car, autonomous truck etc., a robot, such as a delivery robot, an autonomous robot, such as an autonomous delivery robot, a flying capable vehicle, such as a flying capable drone, flying capable robot, aircraft etc., or a autonomous flying capable vehicle, such as an autonomous flying capable drone, an autonomous flying capable robot, autonomous aircraft etc.
  • a vehicle such as a car, truck, motorcycle etc.
  • an autonomous vehicle such as an autonomous car, autonomous truck etc.
  • a robot such as a delivery robot
  • an autonomous robot such as an autonomous delivery robot
  • a flying capable vehicle such as a flying capable drone, flying capable robot, aircraft etc.
  • a autonomous flying capable vehicle such as an autonomous flying capable drone, an autonomous flying capable robot, autonomous aircraft etc.
  • the ego-vehicle 5 may comprise a localization unit (not shown in Figure 2) configured to determine the current location of the ego-vehicle 5 and, thus, the current location data of the ego-vehicle 5.
  • the localization unit may be configured for a short-term localization, e.g. at the scope of Is to 10s.
  • the localization unit may comprise or correspond to one or more inertial measurement units (IMU).
  • IMU inertial measurement units
  • the localization unit is configured to perform an odometry process, such as an inertial odometry process, a wheel odometry process and/or an optical respectively visual odometry process etc.
  • Figure 3A shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
  • CNN Convolutional Neural Network
  • the above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 3A.
  • the CNN 3 of Figure 3A may be an embodiment of the CNN 3 of the device 1 of Figure 2.
  • the CNN 3 comprises a first layer LI, one further layer L2a and one first buffer 4a, wherein the further layer L2a is a consecutive layer following the first layer LI. That is, according to the embodiment of Figure 3 A, the first layer LI is the preceding layer of the further layer L2a. As described already above, the CNN 3 may comprise more than one further layer. In addition or alternatively, the CNN 3 may comprise one or more optional additional layers (not shown in Figure 3A).
  • one or more convolutional operations are performed on the basis of a current output tensor al(ti) of the preceding layer LI (which is the first layer LI), wherein the current output tensor al(ti) originates from the data of the current scan of the current time point ti, and a previous output tensor al(ti-i) of the preceding layer LI, wherein the previous output tensor al(ti-i) of the preceding layer LI is the newest tensor stored in the first buffer 4a and originates from data of a previous scan (not shown in Figure 3A) inputted to the CNN directly before the data of the current scan.
  • the previous scan is inputted at a previous time point ti-i directly before the current time point ti and, thus, the previous scan may be referred to as the scan of the previous time point ti-i .
  • the current output tensor a2(ti) of the preceding layer LI may be stored as the newest tensor in the first buffer 4a (not shown in Figure 3A). Therefore, at the directly subsequent time point ti+i directly after the current time point ti, one or more convolutional operations are performed on the basis of the directly subsequent output tensor al(ti+i) of the preceding layer LI originating from the data of the scan provided by the 2D or 3D sensor at the directly subsequent time point ti +i and the current output tensor al(ti) stored as the newest tensor in the first buffer 4a at the directly subsequent time point ti+i .
  • the CNN 3 of Figure 3 A allows to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by the 2D or 3D sensor 2. Since the CNN 3 comprises the first buffer 4a for storing the output tensor of the first layer LI, the need of re-computation of a previous output tensor al(ti-i) of the first layer LI for performing one or more convolutional operations at the further layer L2a at a current time point ti is overcome.
  • one or more convolutional operations are performed on the basis of the current output tensor al(ti) of the first layer LI originating from the data of the current scan and the previous output tensor al(ti-i) of the first layer LI being the newest tensor stored in the first buffer 4a.
  • the inference time of the CNN 3 in the inference phase is reduced.
  • the temporal size of the first buffer 4a equals to one (“7”), because the first buffer 4a may store the output tensor of the first layer LI for only one time point.
  • the previous output tensor al(ti-i) of the preceding layer LI which was stored as the newest tensor in the first buffer 4a at the previous time point ti-i directly before the current time point ti, is simultaneously dropped respectively deleted at the current time point ti.
  • the temporal size of the first buffer 4a is one less than the temporal size of a convolutional kernel of the first layer LI.
  • the temporal size of the convolutional kernel of the first layer LI may correspond to two (“2”) ⁇
  • the temporal size of the first buffer may alternatively correspond to a number greater than one (“1”) and, thus, the first buffer may be configured to store the output tensor of the preceding layer LI for more than one time point, i.e. for two or more directly subsequent time points, respectively for more than one scan, i.e. for two or more directly subsequent scans.
  • Figure 3B shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
  • CNN Convolutional Neural Network
  • the CNN 3 of Figure 3B may be an embodiment of the CNN 3 of the device 1 of Figure 2.
  • the CNN 3 of Figure 3B differs from the CNN 3 of Figure 3A in that the CNN 3 of Figure 3B comprises a second buffer 4b for storing a tensor input to the CNN 3. Therefore, the description of the CNN 3 of Figure 3A is also valid for the CNN 3 of Figure 3B.
  • the above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 3B.
  • the additional feature(s) of the CNN 3 of Figure 3B respectively the differences between the CNN 3 of Figure 3A and the CNN 3 of Figure 3B are described.
  • the CNN 3 comprises, besides the first layer LI, the one further layer L2a and the first buffer 4a, a second buffer 4b for storing a tensor input to the CNN 3.
  • the first layer LI one or more convolutional operations are performed on the basis of the current tensor epc(ti) originating from the data of the current scan of the current time point ti and a previous tensor epc(ti-i) generating a current output tensor al(ti) of the first layer LI originating from the data of the current scan, wherein the previous tensor epc(ti-i) is the newest tensor stored in the second buffer 4b at the time point ti-i and corresponds to the data of the previous scan inputted to the CNN 3 at the time point ti-i directly before the data of the current scan input to the CNN 3 at the current time point ti.
  • the current output tensor al(ti) of the first layer LI may be stored as the newest tensor in the second buffer 4b (not shown in Figure 3B). Therefore, at the directly subsequent time point ti+i directly after the current time point ti, one or more convolutional operations are performed on the basis of the directly subsequent tensor epc(ti +i ) originating from the data of the scan provided by the 2D or 3D sensor at the directly subsequent time point ti+i and the current tensor epc(ti) stored as the newest tensor in the second buffer 4b at the directly subsequent time point ti+i.
  • the CNN 3 of Figure 3B allows to further reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. Since the CNN 3 comprises a second buffer 4b for storing the tensor input to the CNN 3, the need of re-computation of a previous tensor epc(t-l) input to the CNN 3 on the basis of the corresponding data of the corresponding previous scan for performing one or more convolutional operations at the first layer LI at the current time point t, is overcome.
  • one or more convolutional operations are performed on the basis of the current tensor epc(ti) and the previous tensor epc(ti-i) being the newest tensor stored in the second buffer 4b.
  • the temporal size of the second buffer 4b equals to one (“7”). Because the second buffer 4b may store the tensor input to the CNN 3 for only one time point respectively for one scan.
  • the previous tensor epc(ti- i) which was stored as the newest tensor in the second buffer 4b at the previous time point ti-i directly before the current time point ti, is simultaneously dropped respectively deleted at the current time point ti.
  • the temporal size of the second buffer 4b may alternatively correspond to a number greater than one (“7”) and, thus, the second buffer 4b may be configured to store the tensor of the input to the CNN 3 for more than one time point, i.e. for two or more directly subsequent time points, respectively for more than one scan, i.e. for two or more directly subsequent scans.
  • the first buffer 4a and the second buffer 4b have a different temporal size. Alternatively, they may have the same temporal size.
  • FIG. 4 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
  • CNN Convolutional Neural Network
  • the CNN 3 of Figure 4 may be an embodiment of the CNN 3 of the device 1 of Figure 2.
  • the CNN 3 of Figure 4 differs from the CNN 3 of Figure 3A in that the CNN 3 of Figure 4 comprises a first buffer 4a with a temporal size of two (“2”). Therefore, the description of the CNN 3 of Figure 3A is also valid for the CNN 3 of Figure 4.
  • the above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 4.
  • the additional feature(s) of the CNN 3 of Figure 4 respectively the differences between the CNN 3 of Figure 3 A and the CNN 3 of Figure 4 are described.
  • the temporal size of the first buffer 4a equals to two (“2”), because the first buffer 4a may store the output tensor of the first layer LI (which is the preceding layer of the further layer L2a) for two directly subsequent time points respectively for two directly subsequent scans.
  • the first buffer 4a stores as the newest tensor the previous output tensor al(ti-i) of the first layer LI originating from data of a previous scan inputted to the CNN 3 at the previous time point ti-i directly before the data of the current scan, which are input at the current time point ti to the CNN 3 in the form of the current tensor epc(ti), and the second previous output tensor a 1 (ti-2) of the first layer LI originating from data of a second previous scan inputted to the CNN 3 at the second previous time point ti-2 directly before the previous time point tn.
  • the second previous output tensor al(ti-2) of the first layer LI is simultaneously dropped respectively deleted at the current time point ti (This is indicated in the dashed box on the left side of Figure 4).
  • the temporal size of the first buffer 4a is only two and, thus, the first buffer 4a may store the output tensor of the first layer LI for only two directly subsequent time points, such as for the two directly subsequent time points tw and ti-i or ti-i and ti.
  • the temporal size of the first buffer may alternatively correspond to a number greater than two (“2”) and, thus, the first buffer may be configured to store the output tensor of the first layer LI for more than two time points, i.e. for three or more directly subsequent time points, respectively for more than two scans, i.e. for three or more directly subsequent scans.
  • Figure 5 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
  • CNN Convolutional Neural Network
  • the CNN 3 of Figure 5 may be an embodiment of the CNN 3 of the device 1 of Figure 2.
  • the CNN 3 of Figure 5 differs from the CNN 3 of Figure 4 in that the CNN 3 of Figure 5 comprises a second buffer 4b for storing a tensor input to the CNN 3. Therefore, the description of the CNN 3 of Figure 3B and 4 is also valid for the CNN 3 of Figure 5.
  • the above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 5.
  • the additional feature(s) of the CNN 3 of Figure 5 respectively the differences between the CNN 3 of Figure 4 and the CNN 3 of Figure 5 are described.
  • the CNN 3 comprises, besides the first layer LI, the one further layer L2a and the first buffer 4a, a second buffer 4b for storing a tensor input to the CNN 3.
  • the temporal size of the second buffer 4b equals to two (“2”), because the second buffer 4b may store a tensor that is input to the CNN 3 for two directly subsequent time points.
  • the second buffer 4a stores as the newest tensor the previous tensor epc(ti-i) corresponding to data of a previous scan inputted to the CNN 3 at the previous time point ti-i directly before the data of the current scan, which are input at the current time point ti to the CNN 3 in the form of the current tensor epc(ti), and the second previous tensor epc(tj-2) corresponding to data of a second previous scan inputted to the CNN 3 at the second previous time point ti-2 directly before the previous time point ti-i.
  • the second previous tensor epc(ti-2) is simultaneously dropped respectively deleted at the current time point ti (This is indicated in the dashed box on the left side of Figure 5).
  • the temporal size of the second buffer 4b is only two and, thus, the second buffer 4b may store the tensor that is input to the CNN 3 for only two directly subsequent time points, such as for the two directly subsequent time points ti-2 and ti-i or ti-i and ti.
  • the temporal size of the second buffer 4b may alternatively correspond to a number greater than two (“2”) and, thus, the second buffer 4b may be configured to store the tensor input to the CNN 3 for more than two time points, i.e. for three or more directly subsequent time points.
  • Figure 6 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
  • CNN Convolutional Neural Network
  • the CNN 3 of Figure 6 may be an embodiment of the CNN 3 of the device 1 of Figure 2.
  • the CNN 3 of Figure 6 differs from the CNN 3 of Figure 3 A in that the CNN 3 of Figure 6 comprises three further layers L2a, L2b and L2c and an optional second buffer 4b for storing a tensor input to the CNN 3 and in that an optional voxelization VX is performed. Therefore, the description of the CNN 3 of Figures 3 A and 3B is also valid for the CNN 3 of Figure 6.
  • the CNN 3 comprises a first layer LI, three further layers L2a, L2b and L2c, three first buffers 4a and one optional second buffer 4b.
  • the first further layer L2a is a consecutive layer following the first layer LI
  • the second further layer L2b is a consecutive layer following the first further layer L2a
  • the third further layer L2c is a consecutive layer following the second further layer L2b. That is, according to the embodiment of Figure 6, the first layer LI is the preceding layer of the first further layer L2a
  • the first further layer L2a is the preceding layer of the second further layer L2b
  • the second further layer L2b is the preceding layer of the third further layer L2c.
  • the CNN 3 may comprise only one further layer or only two further layers or more than three further layers.
  • the CNN 3 may comprise one or more optional additional layers (not shown in Figure 6).
  • one or more of the one or more optional additional layers may be arranged between further layers and/or between the first layer and the first further layer.
  • data pc(ti) of a current scan provided by a 2D or 3D sensor configured to measure distance are input in the form of a current tensor epc(ti) of the current time point f into the CNN 3.
  • the current tensor epc(ti) may be generated on the basis of the data pc(ti) of the current scan by optionally performing a voxelization VX.
  • the current tensor epc(ti) may be generated on the basis of the data pc(ti) of the current scan and the current location data of an ego-vehicle, such as the ego-vehicle 5 of Figure 2, in the grid of a local navigational frame by performing a voxelization VX, in case the 2D or 3D sensor providing the data pc(ti) of the current scan is arranged on the ego-vehicle.
  • the data of the current scan pc(ti) corresponds to a point cloud, wherein a point cloud is a set of points in an N-dimensional space along with their attributes, wherein N corresponds to the number of spatial dimensions.
  • Lj may be 2a, 2b or 2c
  • one or more convolutional operations are performed on the basis of a current output tensor a (ti) of the preceding layer L v , wherein the current output tensor a w (ti) originates from the data pc(ti) of the current scan of the current time point ti, and a previous output tensor aw(ti-i) of the preceding layer L v , wherein the previous output tensor a w (ti-i) of the preceding layer L v is the newest tensor stored in the respective first buffer 4a and originates from data of a previous scan (not shown in Figure 6) inputted to the CNN directly before the data
  • the previous scan is inputted at a previous time point ti-i directly before the current time point ti and, thus, the previous scan may be referred to as the scan of the previous time point ti-i.
  • the current output tensor a w (ti) of the preceding layer L v may be stored as the newest tensor in the corresponding first buffer 4a (shown in the dashed box on the left side Figure 6). Therefore, at the directly subsequent time point h +i directly after the current time point ti, at each further layer L j (j may be 2a, 2b or 2c) one or more convolutional operations are performed on the basis of the directly subsequent output tensor a w (ti+i) of the preceding layer L v originating from the data of the scan provided by the 2D or 3D sensor at the directly subsequent time point ti +i and the current output tensor a w (ti) stored as the newest tensor in the corresponding first buffer 4a at the directly subsequent time point ti+i.
  • the temporal size of the first buffers 4a equals to one (“7”), because the first buffers 4a may store the output tensor of the corresponding layer (preceding layer) for only one time point respectively one scan.
  • the current output tensor a w (ti) of the corresponding layer L v (preceding layer) is stored as the newest tensor in the corresponding first buffer 4a
  • the temporal size of a first buffer 4a is one less than the temporal size of a convolutional kernel of the corresponding layer (preceding layer).
  • the temporal size of the convolutional kernel of the layers LI, L2a and L2b may correspond to two (“2”).
  • the temporal size of one or more first buffers 4a may alternatively correspond to a number greater than one (“7”) and, thus, the one or more first buffers 4a may be configured to store the output tensor of a corresponding preceding layer for more than one time point, i.e. for two or more directly subsequent time points, respectively for more than one scan, i.e. for two or more directly subsequent scans.
  • one or more convolutional operations are performed on the basis of the current tensor epc(ti) originating from the data pc(tj) of the current scan of the current time point (ti) and a previous tensor epc(ti-i) generating a current output tensor a 1 (ti) of the first layer LI originating from the data pc(ti) of the current scan.
  • the previous tensor epc(ti-i) is the newest tensor stored in the optional second buffer 4b and corresponds to the data of the previous scan inputted to the CNN 3 at the previous time point ti-i directly before the data pc(ti) of the current scan.
  • the current output tensor al(ti) of the first layer LI may be stored as the newest tensor in the optional second buffer 4b (shown in the dashed box on the left of Figure 6). Therefore, at the directly subsequent time point ti +i directly after the current time point ti, one or more convolutional operations are performed on the basis of the directly subsequent tensor epc(ti +i ) originating from the data of the scan provided by the 2D or 3D sensor at the directly subsequent time point ti +i and the current tensor epc(ti) stored as the newest tensor in the optional second buffer 4b at the directly subsequent time point ti+i .
  • the temporal size of the optional second buffer 4b equals to one (“7”), because the optional second buffer 4b may store the tensor input to the CNN 3 for only one time point respectively one scan.
  • the previous tensor epc(ti-i) which was stored as the newest tensor in the second buffer 4b at the previous time point ti-i directly before the current time point ti, is simultaneously dropped respectively deleted at the current time point ti (shown in the dashed box on the left of Figure 6).
  • the temporal size of the optional second buffer 4b may alternatively correspond to a number greater than one (“7”) and, thus, the second buffer 4b may be configured to store the tensor of the input to the CNN 3 for more than one time point, i.e. for two or more directly subsequent time points, respectively for more than one scan, i.e. for two or more directly subsequent scans.
  • one or more of the first buffers 4a and the optional second buffer 4b have a different temporal size.
  • the first buffers 4a and the optional second buffer 4b may have the same temporal size.
  • Figure 7 shows a grid G of a local navigational frame LNF with a location of an ego-vehicle according to an embodiment of the invention for two directly subsequent time points.
  • the ego-vehicle (not shown in Figure 7) may correspond to the ego-vehicle 5 of Figure 2.
  • the above description with respect to the device according to the first aspect of the invention and its implementation forms and the above description of the ego-vehicle according to the third aspect of the invention and its implementation forms are correspondingly valid for the description of Figure 7, in particular for the description of the ego-vehicle 5 of Figure 7.
  • the ego-vehicle 5 may be configured for autonomous movement.
  • the device 1 of the ego- vehicle 5 is configured to control autonomous movement of the ego-vehicle 5 on the basis of the processing of the data stream of scans provided by the one or more 2D or 3D sensors 2 configured to measure distance and arranged on the ego-vehicle 5.
  • the local navigational frame LNF shown in Figure 7 is a coordinate system of two spatial dimension x, y with top-down view on the ground surface, which is relative to a position of the ego-vehicle 5. That is, the location respectively position of the ego-vehicle 5 is within the local navigational frame LNF.
  • location and “ position ” may be used as synonyms.
  • the data e.g. a point cloud
  • the region of interest (ROI) for processing the data stream of scans provided by the one or more 2D or 3D sensors 2 is moving with the location of the ego-vehicle 5 in the local navigational frame LNF, when the ego-vehicle 5 is moving.
  • the ROI corresponds to a predefined regular grid respectively area of the local navigational frame LNF with the ego-vehicle 5 in the center wherein the regular grid is composed of a plurality of cells.
  • Figure 7 shows the grid Gi (area) of the current ROI Ri at a current time point ti and the grid Gi-i (area) of the previous ROI Ri-i of a previous time point ti-i that is directly before the current time point ti-i .
  • the ROI may be chosen to correspond to a grid in the local navigational frame LNF of size 140 m x 140 m.
  • the cells may each be of size 0.5m x 0.5 m.
  • the ROI may be chosen to correspond to a grid in the local navigational frame LNF of size between 70 m x 70 m and 250 m x 250.
  • the cells may each be of size between 0.05 m x 0.05 m and 0.5 m x 0.5 m.
  • the data of a current scan (e.g. all points of a point cloud, such as a LIDAR point cloud in case of the 2D or 3D sensor being a LIDAR sensor) are mapped into the regular grid G of the local navigational frame LNF. That is, the device 1 is configured to generate on the basis of the data of the current scan and the current location data pxi and pyi of the ego-vehicle in the grid G of the local navigational frame LNF the current tensor input to the CNN by performing a change of coordinates of the data of the current scan into the local navigational frame LNF.
  • a current scan e.g. all points of a point cloud, such as a LIDAR point cloud in case of the 2D or 3D sensor being a LIDAR sensor
  • the current location data pxi and pyi describe the current location pi respectively position of the ego-vehicle 5 within the grid G of the local navigational frame LNF at a current time point ti.
  • the tensors stored in the buffers of the CNN 3 correspond to the corresponding cells of the regular grid G of the local navigational frame LNF.
  • the grid Gi (area) of Figure 7 corresponds to a current tensor that is stored in a corresponding buffer of the CNN 3 as the newest tensor at the current time point ti.
  • the grid G corresponds to the predefined grid (area) of the current ROI Ri at the current time point ti.
  • the grid Gi is made of the plain cells without any pattern and the cells with a dotted pattern within the corresponding bold frame.
  • the grid GM of Figure 7 corresponds to a previous tensor that is stored in the same buffer of the CNN 3 as the newest tensor at a previous time point ti-i directly before the current time point ti.
  • the grid Gi-i corresponds to the predefined grid (area) of the previous ROI R at the previous time point ti-i.
  • the grid Gi-i is made of the cells with a diagonally stripped pattern and the plain cells without any pattern within the corresponding bold frame.
  • the grid Gi of the current time point ti is not congruent to the grid Gj-i of the previous time point tj-i, because the ego-vehicle 5 has moved from the previous location pi-i of the previous time point ti-i to the current location pi of the current time point ti. That is, at the previous time point ti-i the ego-vehicle 5 was at the previous location pi-i and at the current time point ti the ego-vehicle 5 is at the current position pi. As a result of this movement, the ROI changes from the previous ROI Ri-i to the current ROI Ri.
  • the previous location pi-i of the ego-vehicle 5 is described by the previous location data pxi-i, pyi-i and the current location pi of the ego-vehicle 5 is described by the current location data pxi, pyi.
  • the device 1 of the ego-vehicle 5 may be configured to input, together with the data of the current scan, current location data pxi, pyi of the ego-vehicle 5 in the grid G of the local navigational frame LNF to the CNN 3 (the CNN is not shown in Figure 7), and pad and crop (in particular zero-pad and crop) in each buffer of the CNN the newest tensor being stored, in case the current location data pxi, pyi of the ego-vehicle 5 do not match previous location data pxi-i, pyi-i inputted together with the data of the previous scan
  • the plain cells within the bold frame correspond to the data of the newest tensor being stored in a buffer that may be re-used despite the location change of the ego-vehicle 5 and, thus, despite the change of the ROI from the previous ROI R to the current ROI Ri.
  • the cells with the diagonally stripped pattern correspond to the data of the newest tensor stored in the buffer that are cropped.
  • the terms “ dropped ’ or “ deleted ” may be used as synonyms for the term “ cropped ⁇
  • the cells with the dotted pattern correspond to the data of the newest tensor stored in the buffer that are padded, in particular that are zero-padded.
  • Padding data may be understood as overwriting the values of the data by a predefined value.
  • zero-padding data may be understood as overwriting the values of the data with zeros.
  • the device 1 of the ego-vehicle 5 may be configured to generate on the basis of the data of the current scan and the current location data pxi, pyi of the ego-vehicle 5 in the grid G of the local navigational frame LNF the current tensor that is input to the CNN 3 by performing a change of coordinates of the data of the current scan into the local navigational frame LNF.
  • the device 1 of the ego-vehicle 5 may be configured to generate on the basis of the data of the current scan and the current location data pxi, pyi of the ego-vehicle 5 in the grid G of the local navigational frame LNF the current tensor that is input to the CNN 3 by additionally performing a voxelization (not shown in Figure 7).
  • the location data may be stored in a location field of one or more of the one or more first buffers 4a and the optional second buffer of the CNN, and the location field of the respective one or more buffers is updated (not shown in Figure 7) with the current location data pxi, pyi, in case the current location data px,, pyi of the ego-vehicle 5 do not match the previous location data pxi-i, pyi-i.
  • the ego-vehicle 5 moves within the same cell of the grid G, there is no updating of the location field of the respective one or more buffers.
  • the current location data px,, py, of the ego-vehicle 5 do not match previous location data pxi-i, pyi-i inputted together with the data of the previous scan, but the current location data pxi, pyi and the previous location data pxi-i, pyi-i of the ego-vehicle 5 lie respectively are located within the same cell of the grid G, the location field of the respective one or more buffers is not updated with the current location data pxi, pyi.
  • Figure 8 shows a diagram of a method according to an embodiment of the invention with respect to storing tensors originating from data of a current scan of a current time point ti in buffers of a CNN according to an embodiment of the invention.
  • the device 1 of the ego-vehicle 5 obtains data of a current scan provided by the one or more 2D or 3D sensors 2 in the form of a current tensor of shape [C, H, W] together with current location data pxi, pyi of the ego- vehicle 5 in the grid G of the local navigational frame LNF at a current time point ti.
  • C is the number of channels.
  • the current tensor is generated by performing a change of coordinates of the data of the current scan into the local navigational frame LNF.
  • “H” is the size of the current tensor along the y coordinate
  • “W” is the size of the current tensor along the x coordinate.
  • the device 1 determines whether the current location data pxi, pyi do not match the previous location data pxi-i, pyi-i of the ego- vehicle 5. That is, the device 1 determines whether the location of the ego-vehicle 5 has changed. In particular, the device 1 determines whether the current location data pxi, pyi do not match the previous location data pxi-i, pyi-i of the ego-vehicle 5 and whether the current location data px,, pyi and the previous location data pxi-i, pyi-i do not lie within the same cell of the grid G of the local navigational frame LNF.
  • the device 1 in particular determines whether the location of the ego-vehicle 5 has changed such that the current location of the ego-vehicle is in a different cell compared to the cell of the previous location.
  • the method proceeds to the third step S83.
  • the determination of the second step S82 yields a “NO” (the location of the ego-vehicle has not changed or the location of the ego-vehicle is still in the same cell)
  • the method proceeds to the fifth step S85.
  • the device 1 pads and crops, in particular zero-pads and crops, in each buffer of the CNN 3 the newest tensor being stored to match to the current location data pxi, py, and, thus, the current ROI Ri. That is, the tensor that is stored at the previous time point ti-i directly before the current time point ti in a respective buffer as the newest tensor and that is still stored at the third step S83 as the newest tensor of the respective buffer is padded and cropped.
  • the device 1 updates the location field with the current location data pxi, py, of the ego-vehicle 5 in the buffers comprising a location field.
  • the device 1 stores in each buffer the corresponding tensor originating from the data of the current scan as the newest tensor and simultaneously drops respectively deletes the oldest tensor in the buffers that are full.
  • the device 1 stores in each first buffer 4a the output tensor of the corresponding layer (preceding layer) originating from the data of the current scan as the newest tensor.
  • the CNN 3 of the device 1 comprises a second buffer
  • the device 1 stores in the second buffer the current tensor originating from the data of the current scan as the newest tensor.
  • Figure 9 shows a diagram of a method of employing a Convolutional Neural Network (CNN) in an inference phase according to an embodiment of the invention for processing a data stream of scans containing spatial information provided by one or more 2D or 3D sensor configured to measure distance.
  • the method of Figure 9 may be performed by the device 1 of Figure 2.
  • the above description of the device according to the first aspect and its implementation forms, the above description of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the method of Figure 9.
  • the method step SI 01 and the optional method step SI 02 may be performed on the basis of data pc(ti) of a current scan provided by a 2D or 3D sensor configured to measure distance, such as a LIDAR sensor, at a current time point ti.
  • the 2D or 3D sensor is arranged on an ego-vehicle, such as the ego-vehicle 5 of Figure 2.
  • the current tensor epc(ti), that is input to the Convolutional Neural Network 3 (CNN), is generated on the basis of the data pc(ti) of the current scan and the current location data pxi, pyj of the ego-vehicle in a grid of a local navigational frame, such as the local navigational frame of Figure 7, by performing a change of coordinates of the data pc(ti) into the local navigational frame.
  • the current tensor epc(ti) of the current scan and the current location data pxi, pyi of the ego-vehicle in the grid of the local navigational frame may be generated by additionally performing a voxelization of the optional method step SI 02.
  • the CNN 3 is configured to generate on the basis of the current tensor epc(ti) the current output data OUT(ti), on the basis of which a navigation process of the ego-vehicle may be performed.
  • the CNN 3 shown in Figure 9 may correspond to any CNN 3 of Figures 2, 3A, 3B, 4, 5 and 6.
  • the above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 9.
  • the current output data OUT(ti) may be decoded and a non-maximum suppression may be performed thereon.
  • detected object boxes may be provided.
  • the size of the data pc(ti) being a point cloud is “M * 4”, wherein “M” is the number of points in the point cloud and the “4” indicates that each point in the point cloud comprises three spatial dimensions (point cloud with three spatial dimensions) and one attribute, such as a scalar reflection brightness attribute.
  • the size of the current location data px,, pyi is two (“2”) because the location of the ego- vehicle in the local navigational frame, being a 2-dimensional coordinate system according to the embodiment of Figure 9, is described by two coordinates (x and y coordinate).
  • the size of the current tensor epc(ti) being an encoded point cloud is “C*H*W”.
  • C is the number of channels of the current tensor epc(ti) (number of channels in the encoded point cloud), wherein C is greater than or equal to one (C > 1).
  • is the size of the current tensor epc(ti) along the y coordinate of the local navigational frame and
  • W is the size of the current tensor along the x coordinate of the local navigational frame.
  • “H” and “W” define the spatial resolution of the area of the region of interest (ROI) of the grid of the local navigational frame.
  • “H” and “W” may each be 280 cells of the grid of the local navigational frame, such that the ROI corresponds to an area of 140 m x 140 m, in case the cells of the regular grid of the local navigational frame each are of size 0.5m x 0.5 m.
  • the size of the output data OUT(ti) is “C2*H*W”, wherein “H” and “W” are described as above.”
  • C2 is the number of channels of the output data OUT(ti) that may be the same or different than the number of channels (“C”) of the current tensor epc(ti).
  • “B” is the number of detected objects and “S” is the size of the metadata related to one object.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Remote Sensing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Astronomy & Astrophysics (AREA)
  • Neurology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The present invention relates to a device (1) for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor (2) configured to measure distance, wherein the device (1) is configured to employ a Convolutional Neural Network (3), CNN, in an inference phase. The CNN comprises a first layer (L1) and one or more further layers (L2a, L2b) following the first layer (L1) and one or more first buffers (4a) for storing an output tensor of a respective preceding layer (L1; L2). The device (1) is further configured to input data of a current scan provided by the 2D or 3D sensor (2) in the form of a current tensor (epc(ti)) into the CNN (3); and perform, at each further layer (L2a; L2b), one or more convolutional operations on the basis of a current output tensor (a1(ti); a2(ti)) of the preceding layer (L1; L2a) originating from the data of the current scan and a previous output tensor (al (ti -1); a2(ti -1)) of the preceding layer (L1, L2a). The previous output tensor (al(ti -1), a2(ti -1)) of the preceding layer (L1, L2a) is the newest tensor stored in the respective first buffer (4a) of the one or more first buffers (4a) and originates from data of a previous scan inputted to the CNN directly before the data of the current scan. Furthermore, the device (1) is configured to store the current output tensor (a1(ti), (a2(ti)) of the preceding layer (L1, L2a) as the newest tensor in the respective first buffer (4a).

Description

PROCESSING A DATA STREAM OF SCANS CONTAINING SPATIAL INFORMATION PROVIDED BY A 2D OR 3D SENSOR CONFIGURED TO MEASURE DISTANCE BY USING A CONVOLUTIONAL NEURAL NETWORK
(CNN)
TECHNICAL FIELD
The present disclosure relates to a device for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance, wherein the device is configured to employ a Convolutional Neural Network (CNN) in an inference phase. The present disclosure further relates to an ego-vehicle comprising such a device and one or more 2D or 3D sensors configured to measure distance. The present disclosure furthermore relates to a hardware implementation of a Convolutional Neural Network (CNN) for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. Moreover, the present disclosure relates to a method of employing a Convolutional Neural Network (CNN) in an inference phase for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. Additionally, the present disclosure relates to a computer program comprising program code for performing such a method, to a non- transitory storage medium storing executable program code which, when executed by a processor, causes such a method to be performed, and to a computer comprising a memory and a processor, which are configured to store and execute program code to perform such a method.
BACKGROUND
A data stream of scans containing spatial information provided by a 2D or 3D distance sensor can be processed by means of a neural network. The spatial information is spatial information about an environment or vicinity of the 2D or 3D sensor. Objects in the environment of the 2D or 3D sensor can be detected on the basis of a processing result of the processing of the data stream. For example, the 2D or 3D sensor may be installed on an ego-vehicle. In this case, the spatial information provided by the 2D or 3D sensor corresponds to spatial information about the environment of the ego-vehicle and, thus, on the basis of the processing result of the processing of the data stream objects, such as cars, pedestrian, bicyclists, motorcyclist etc., in the environment of the ego-vehicle may be detected.
The term “ego vehicle ” refers to a vehicle that is equipped with one or more sensors (e.g. one or more 2D or 3D distance sensors) for sensing an environment of the vehicle and which operates based on data from those sensors and not necessarily based on any other data about its environment. In other words, an ego vehicle operates based on its own “view” of its environment.
SUMMARY
Embodiments of the invention base also on the following considerations made by the inventors:
In order to process a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance a Convolutional Neural Network (CNN) may be used.
Figure 1 shows an example of an operation of a Convolutional Neural Network 3 (CNN) at a current time point ti.
The term “present” may be used as a synonym for the term “current”. That is, e.g. the term “present time point” may be used as a synonym for the term “current time point”.
The CNN 3 according to Figure 1 comprises four layers LI, L2a, L2b and L2c, wherein at each layer one or more convolutional operations are performed.
At a current time point ti, data pc(ti) of a current scan are provided by the 2D or 3D sensor and one or more convolutional operations are performed at each layer of the CNN 3 on the basis of a current tensor originating from the data pc(ti) of the current scan (scan provided at time point ti) and a previous tensor originating from data pc(t-i) of a previous scan (scan provided at time point ti-i) inputted to the CNN 3 directly before the data of the current scan, wherein the current tensor and the previous tensor are provided to the respective layer.
The one or more convolutional operations of a layer are indicated in Figure 1 by two arrows originating from the respective two tensors on the basis of which the one or more convolutional operations are performed.
For example, as shown in Figure 1, at the current time point tj the data pc(ti) of the scan of the time point ti is provided as the data of the current scan by the 2D or 3D sensor and the tensor epc(ti) originating from the data pc(ti) of the current scan is input to the CNN 2 and, thus, is provided to the first layer LI. At the same time the previous tensor epc(ti-i) originating from the data pc(ti-i) of the previous scan provided by the 2D or 3D sensor at the previous time point ti-i directly before the current scan of the time point ti, is provided to the first layer LI, wherein at the first layer LI one or more convolutional operations of the first layer LI are performed on the basis of the current tensor epc(ti) and the previous tensor epc(ti-i). In order to be able to use at the current time point ti the previous tensor epc(ti-i) for the one or more convolutional operations of the first layer LI, the previous tensor epc(ti-i) has to be generated again respectively has to be re-computed at the current time point ti. In the case of the first layer LI this requires performing a voxelization VX on the basis of the data pc(ti-i) of the previous scan. The voxelization VX is indicated in Figure 1 by a single arrow originating from the respective data on the basis of which the voxelization is performed.
However, in case of performing the one or more convolutional operations of a layer deeper in the CNN 3 respectively further away from the first layer LI, e.g. of the last layer L2c of the CNN 3, besides performing one or more convolutional operations at each preceding layer LI, L2a and L2b on the basis of the respective current tensor provided to the respective layer at the current time point ti (current tensor epc(ti) provided to the first layer LI, current tensor al(ti) being an output tensor of the first layer LI and provided to the second layer L2a, current tensor a2(ti) being an output tensor of the second layer L2a and provided to the third layer L2b, the following re-computation has to be done: a voxelization VX of data pc(ti-i), pc(ti-2), pc(ti-3) and pc(ti-4) of four directly subsequent previous scans that were provided by the 2D or 3D sensor directly before the current scan of the time point ti has to be re-computed, and a plurality of one or more convolutional operations of respective layers LI, L2a, L2b on the basis of previous tensors originating from the previous data pc(ti-i), pc(ti- 2), pc(ti-3) and pc(ti-zt) has to be re-computed for processing the tensors epc(ti), al(ti), a2(ti) and a3(ti) originating from the data pc(ti) of the current scan in order to generate the output tensor a4(ti) of the CNN 3 shown in Figure 1.
The term “ consecutive ” may be used as a synonym for the term “ directly subsequent ".
According to Figure 1, the CNN 3 has four layers LI, L2a, L2b and L2c. This is only an example. According to an embodiment of the invention, the CNN 3 may have more than four layers. The greater the number of layers of the CNN the greater the number of the above mentioned re-processing respectively re-computation steps, which have to be performed on the basis of previous data of previous scans when processing the data of a current scan at a current time point tj using the CNN 3.
As a result, using a CNN for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance has the disadvantage of requiring a high amount of computational resources. This is especially the case, when the 2D or 3D sensor provides scans at a frame rate of at least 10 to 20 frames per second. The term “frame" and “scan" may be used as synonyms.
In view of the above-mentioned problems and disadvantages, embodiments of the present invention aim to improve the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. An objective is to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance.
The objective is achieved by the embodiments of the invention as described in the enclosed independent claims. Advantageous implementations of the embodiments of the invention are further defined in the dependent claims.
A first aspect of the present disclosure provides a device for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance, wherein the device is configured to employ a Convolutional Neural Network (CNN) in an inference phase. The CNN comprises a first layer and one or more further layers following the first layer and one or more first buffers for storing an output tensor of a respective preceding layer. The device is configured to input data of a current scan provided by the 2D or 3D sensor in the form of a current tensor into the CNN. That is, the device is configured to input the data of the current scan in the form of the current tensor into the CNN. Further, the device is configured to perform, at each further layer, one or more convolutional operations on the basis of a current output tensor of the preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer. That is, the device is configured to perform, at each further layer, one or more convolutional operations on the basis of the current output tensor of the preceding layer and the previous output tensor of the preceding layer. The previous output tensor of the preceding layer is the newest tensor stored in the respective first buffer of the one or more first buffers and originates from data of a previous scan inputted to the CNN directly before the data of the current scan. Furthermore, the device is configured to store the current output tensor of the preceding layer as the newest tensor in the respective first buffer.
The device according to the first aspect allows to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. Since the CNN comprises one or more first buffers for storing the output tensor of a respective preceding layer, the need of re-computation of one or more previous tensors, in particular output tensors, of respective preceding layers for performing one or more convolutional operations at each further layer at a current time point is overcome. Namely, the device is configured to perform, at each further layer, one or more convolutional operations on the basis of the current output tensor of the preceding layer originating from the data of the current scan and the previous output tensor of the preceding layer being the newest tensor stored in the respective first buffer. As a result of using one or more first buffers the inference time of the CNN in the inference phase is reduced. The inference time corresponds to the time required by the CNN for providing output data (the output tensor of the last layer of the CNN) starting from the current tensor originating from the data of the current scan and being input into the CNN at the current time. In particular, the computational costs may be reduce from K2/2 (i.e. 0(K2)), in case of no buffers being used (as it is the case in the CNN of Figure 1), to K (i.e. O(K)), in case of the CNN of the device according to the first aspect comprising buffers. K is the number of aggregated scans respectively sweeps provided by the 2D or 3D sensor. In case the 2D or 3D sensor is a LIDAR sensor, K is the number of aggregated LIDAR sweeps. Therefore, the device according to the first aspect allows a real time inference for high values of K, e.g. between 10 and 100 scans.
The passage “a tensor input into the CNN ” may be understood as “a tensor that is input into the CNN \ The passage “a tensor input to the CN ’ may be used as a synonym for the passage “a tensor input into the CNN’.
In particular, the first layer and the one or more further layers of the CNN are the layers of the CNN. Descriptions made herein with respect to a layer referred to by the general term “layer” are valid for the first layer as well as the one or more further layers.
In an implementation form of the first aspect, the CNN comprises one or more optional additional layers, wherein the device is configured to perform, at each optional additional layer, one or more convolutional operations on the basis of a current output tensor of a preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer originating from data of a previous scan inputted to the CNN directly before the data of the current scan. In particular, the first layer, the one or more further layers and the one or more optional additional layers of the CNN are the layers of the CNN. Descriptions made herein with respect to a layer referred to by the general term “layer” may also be valid for the one or more optional additional layers.
The 2D or 3D sensor configured to measure distance may comprise or correspond to one or more LIDAR sensors (light detection and ranging sensors), TOF cameras (time of flight cameras), stereo cameras and/or beamforming radars. That is, the 2D or 3D sensor may comprise or correspond to one or more visual-depth-capable sensors. The 2D or 3D sensor may provide scans at a frame rate of at least 10 to 20 frames per second.
The term “2D sensor configured to measure distance” may be understood to correspond to a sensor that is configured to detect scans with two spatial dimensions (2 -dimensional scans respectively frames) and to measure distance. The term “2D sensor configured to measure distance” may be understood to correspond to a sensor that is configured to detect scans with three spatial dimensions (3-dimensional scans respectively frames) and to measure distance.
In particular, the CNN is a feed-forward neural network that may be described with a directed acyclic graph (DAG). An advantage of a DAG neural network is that it has a finite impulse response operator and relates to finite impulse response filters (FIR filters).
In particular, a buffer is configured to buffer respectively store data, such as tensors. A buffer may be a data structure for buffering data, in particular tensors. The terms “ buffer storage ”, “ rolling buffer ” and “ rolling buffer storage ” may be used to refer to a buffer.
In an implementation form, at one or more of the one or more further layers of the CNN, besides the one or more convolutional operations, one or more optional further operations, such as one or more normalization operations and/or one or more activation operations, may be performed.
The terms “ activation of a layer’’ or “ layer output ” may be used as synonyms for the output tensor of a layer.
In an implementation form of the first aspect, the CNN comprises a second buffer for storing a tensor input to the CNN, and the device is configured to perform, at the first layer, one or more convolutional operations on the basis of the current tensor and a previous tensor generating a current output tensor of the first layer originating from the data of the current scan. The previous tensor is the newest tensor stored in the second buffer and corresponds to the data of the previous scan. Further, the device is configured to store the current tensor as the newest tensor in the second buffer.
Therefore, the device according to the first aspect allows to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. Since the CNN comprises a second buffer for storing the tensor input to the CNN, the need of recomputation of one or more previous tensors input to the CNN on the basis of the corresponding data of the corresponding previous scan for performing one or more convolutional operations at the first layer at a current time point is overcome. Namely, the device is configured to perform, at the first layer, one or more convolutional operations on the basis of the current tensor and the previous tensor being the newest tensor stored in the second buffer.
Descriptions made herein with respect to a buffer referred to by the general term “ buffer ” are valid for the one or more first buffers as well as for the second buffer.
In particular, the number of first buffers of the CNN corresponds to the number of further layers of the CNN. In particular, the number of buffers of the CNN corresponds to the number of one or more first buffers and the second buffer.
In an implementation form of the first aspect, each buffer is a serial-in parallel-out buffer configured to store a new tensor as newest tensor and, in case the buffer is full, to simultaneously drop the oldest tensor stored in the buffer.
In particular, each buffer is a serial-in parallel-out (SIPO) shift register.
In an implementation form of the first aspect, the temporal size of each of the one or more first buffers for storing the output tensor of a respective preceding layer is one less than the temporal size of a convolutional kernel of the respective preceding layer.
In particular, one or more buffers have a different temporal size.
The term “ convolutional matrix'' may be used to refer to a convolutional kernel. The convolutional kernel is used at a layer of the CNN for performing the one or more convolutional operations of the layer.
The temporal size of a first buffer corresponds to the number of consecutive respectively directly subsequent time points for which the first buffer is configured to store the output tensor of the respective layer (preceding layer). Therefore, the temporal size may also be defined in terms of the number of directly subsequent scans of the 2D or 3D sensor for which the first buffer may store the output tensor of the respective layer (preceding layer). For example, if the temporal size of a first buffer is one (“7”) then the first buffer may only store one output tensor of a respective layer originating from data of one scan of one time point. This reduces the storage consumption, such as RAM consumption, to a minimum. Therefore, in this case, when storing the current output tensor (originating from data of a current scan of a current time point) of a layer (preceding layer) as the newest tensor in the respective first buffer, the previous output tensor of the layer (originating from data of the previous scan inputted to the CNN directly before the data of the current scan) already stored in the respective first buffer is dropped respectively deleted, because the first buffer is already full. For example, if the temporal size of a first buffer is three (“3”) then the first buffer may store three output tensors of a respective layer (preceding layer) originating from data of three consecutive scans of three consecutive time points.
Therefore, the temporal size of a buffer corresponds to the number of directly subsequent time points for which the buffer may store a tensor. Accordingly, the temporal size of a buffer corresponds to the number of directly subsequent scans of the 2D or 3D sensor for which the first buffer may store a corresponding tensor.
The term ‘‘‘consecutive” may be used as a synonym for the term “ directly subsequent”.
In an implementation form of the first aspect, each tensor is a tensor with two or more dimensions.
In particular, each tensor is a tensor with one or more spatial dimensions and one channel dimension. The number of spatial dimensions of a tensor may equal to the spatial dimensions of the data of a corresponding scan provided by the 2D or 3D sensor configured to measure distance from which the tensor originates from.
The channel dimension corresponds to the number of channel and is greater than or equal to one (channel dimension > 1).
In particular, the data of a scan (e.g. current scan) corresponds to a point cloud, wherein a point cloud is a set of points in an N-dimensional space along with their attributes, wherein N corresponds to the number of spatial dimensions. In particular, the number N of spatial dimensions may correspond to one spatial dimension, two spatial dimensions or three spatial dimensions. For example, in case the data of a scan (e.g. current scan) corresponds to a point cloud with two spatial dimensions, then the data with the two spatial dimensions may correspond to a bird’s eye view (BEV) representation of a point cloud. In case the data of a scan (e.g. current scan) corresponds to a point cloud with three spatial dimensions, then the data with the three spatial dimensions may correspond to a volumetric representation of a point cloud, e.g. for a flying capable vehicle, such as a flying capable robot, drones, aircraft etc.
For example, a point cloud with three spatial dimensions may be produced by a 2D or 3D sensor configured to measure distance, such as a LIDAR sensor, wherein a wave or a beam, such as a light beam, bounces back from an obstacle in the environment of the 2D or 3D sensor to produce a point with three spatial dimensions with its location in meters and a scalar reflection brightness attribute. Such a point cloud comprising a plurality, e.g. thousands, of points with three spatial dimensions and a scalar reflection brightness attribute may correspond to a tensor with four dimensions. The four dimensions corresponds to the three spatial dimensions and one channel dimension for the scalar reflection brightness attribute.
In an implementation form of the first aspect, each tensor is a tensor with two dimensions, in particular with one spatial dimension and one channel dimension. Alternatively, each tensor is a tensor with three dimensions, in particular with two spatial dimensions and one channel dimension. Alternatively, each tensor is a tensor with four dimensions, in particular with three spatial dimensions and one channel dimension.
In an implementation form of the first aspect, the channel dimension of each first buffer for storing the output tensor of a respective preceding layer corresponds to the channel dimension of the respective preceding layer.
In an implementation form of the first aspect, the device is configured to generate output data for a navigation process of an ego-vehicle, on which the 2D or 3D sensor is arranged, wherein the device is configured to input, together with the data of the current scan, current location data of the ego-vehicle in a grid of a local navigational frame to the CNN, and pad and crop in each buffer the newest tensor being stored, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan. The term “ ego-vehicle ” may be understood as a mobile platform, in particular mobile robotic platform, bearing one or more sensors, such as one or more 2D or 3D sensors configured to measure distance, that computes respectively operates from the perspective of which the world respectively environment is perceived.
The ego-vehicle may correspond to a vehicle, such as a car, truck, motorcycle etc., an autonomous vehicle, such as an autonomous car, autonomous truck etc., a robot, such as a delivery robot, an autonomous robot, such as an autonomous delivery robot, a flying capable vehicle, such as a flying capable drone, flying capable robot, aircraft etc., or an autonomous flying capable vehicle, such as an autonomous flying capable drone, an autonomous flying capable robot, autonomous aircraft etc.
The ego-vehicle may comprise a localization unit configured to determine the current location of the ego-vehicle and, thus, the current location data of the ego-vehicle. The localization unit may be configured for a short-term localization, e.g. at the scope of Is to 1 Os. The localization unit may comprise or correspond to one or more inertial measurement units (IMU). In particular, the localization unit is configured to perform an odometry process, such as an inertial odometry process, a wheel odometry process and/or an optical respectively visual odometry process etc. An odometry process is a process of understanding an ego location, i.e. the location of the ego-vehicle, based on sensory (e.g. wheel, inertial) information.
The term “ local navigational frame” may be understood as a coordinate system tied to the ground. In particular, the local navigational frame may correspond to a 2-dimensional coordinate system with top-down view on the ground surface (as shown in Figure 7).
In particular, the grid of the local navigational frame is a regular grid composed of a plurality of cells. The term “ local navigational coordinate frame ” may be used as a synonym for the local navigational frame.
In particular, the device is configured to zero-pad and crop in each buffer the newest tensor being stored, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan.
In particular, the grid of the local navigational frame is a regular grid composed of a plurality of cells and in case the ego-vehicle moves within the same cell of the grid, there is no padding and cropping performed by the device. In other words, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan, but the current location data and the previous location data of the ego-vehicle lie respectively are located within the same cell of the grid, the device does not perform padding and cropping.
In an implementation form of the first aspect, the device is configured to control autonomous movement of the ego-vehicle on the basis of the processing of the data stream.
In an implementation form of the first aspect, the device is configured to store the location data in a location field of one or more of the one or more first buffers and the second buffer, and update the location field of the respective one or more buffers with the current location data in case the current location data of the ego-vehicle do not match the previous location data.
In particular, the grid of the local navigational frame is a regular grid composed of a plurality of cells and in case the ego-vehicle moves within the same cell of the grid, there is no updating of the location field of the respective one or more buffers. In other words, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan, but the current location data and the previous location data of the ego-vehicle lie respectively are located within the same cell of the grid, the device does not update the location field of the respective one or more buffers with the current location data. In an implementation form of the first aspect, the device is configured to generate on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by performing a change of coordinates of the data of the current scan into the local navigational frame.
In an implementation form of the first aspect, the device is configured to generate on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by additionally performing a voxelization.
In an implementation form of the first aspect, the device is configured to generated on the basis of the data of the current scan the current tensor input to the CNN by performing a voxelization.
In particular, the data of the current scan corresponds to a point cloud, wherein a point cloud is a set of points in an N-dimensional space along with their attributes, wherein N corresponds to the number of spatial dimensions. Thus, the device may be configured to generate on the basis of the point cloud of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by performing a change of coordinates of the point cloud of the current scan into the local navigational frame.
The current tensor input to the CNN may correspond to an encoded point cloud. An encoded point cloud may be understood as the result of the transformation of a raw respectively unordered point cloud into a voxelized respectively ordered format.
Voxelization is a transformation of an unordered set of points, such as the unordered points of a point cloud into a regular grid. In particular, voxelization is a transformation of an unordered set of points with N spatial dimensions, such as the unordered points of a point cloud with N spatial dimensions, into a regular N-dimensional grid. Voxelization may be performed by pillar encoding or voxel feature encoding. In particular, the number N of spatial dimensions may correspond to one spatial dimension, two spatial dimensions or three spatial dimensions. In an implementation form of the first aspect, the device is configured to detect targets, in particular moving targets, in the vicinity of the 2D or 3D sensor on the basis of the output tensor of the last layer of the one or more further layers of the CNN. Alternatively or additionally, the device is configured to perform a point cloud semantic segmentation on the basis of the output tensor of the last layer of the one or more further layers of the CNN. Alternatively or additionally, the device is configured to perform a free space estimation on the basis of the output tensor of the last layer of the one or more further layers of the CNN.
In particular, the device is configured to detect weak target objects in the vicinity of the 2D or 3D sensor on the basis of the output tensor of the last layer of the one or more further layers of the CNN. A weak target object is an object with low spatial information but significant temporal information. In particular, a weak target object is a moving object, such as a car, pedestrian including a child, bicyclist including a child, motorcyclist, etc., that has few points, in particular between 1 to 5 points, more particular between 1 to 10 points, detectable by the 2D or 3D sensor. For example, in case the 2D or 3D sensor is a LIDAR sensor, a weak target object is an object that has few LIDAR echo points, in particular 1 to 5 LIDAR echo points, more particular between 1 to 10 LIDAR echo points. In fog, rain and snow as well as in a case of heavy occlusions even normally well detectable or visible (by the 2D or 3D sensor such as a LIDAR sensor) objects may become weak target objects.
In an implementation form of the first aspect, in case the 2D or 3D sensor is a LIDAR sensor a weak target object may be defined as a moving object, e.g. from a list of known object categories (classes), that has between 1 and 5, in particular between 1 and 10, LIDAR echo points falling on it.
In order to achieve the device according to the first aspect of the present disclosure, some or all of the implementation forms and optional features of the first aspect, as described above, may be combined with each other.
A second aspect of the present disclosure provides a hardware implementation of a Convolutional Neural Network (CNN) for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. The CNN comprises a first layer and at least one or more further layers following the first layer and one or more first buffers for storing an output tensor of a respective preceding layer. The hardware implementation of the CNN is configured to input data of a current scan provided by the 2D or 3D sensor in the form of a current tensor into the CNN. Further, the hardware implementation of the CNN is configured to perform, at each further layer, one or more convolutional operations on the basis of a current output tensor of the preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer. The previous output tensor of the preceding layer is the newest tensor stored in the respective first buffer of the one or more first buffers and originates from data of a previous scan inputted to the CNN directly before the data of the current scan. The hardware implementation of the CNN is configured to store the current output tensor of the preceding layer as the newest tensor in the respective first buffer.
In an implementation form of the second aspect, the CNN comprises one or more optiohal additional layers, wherein the hardware implementation of the CNN is configured to perform, at each optional additional layer, one or more convolutional operations on the basis of a current output tensor of a preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer originating from data of a previous scan inputted to the CNN directly before the data of the current scan. In particular, the first layer, the one or more further layers and the one or more optional additional layers of the CNN are the layers of the CNN.
In an implementation form of the second aspect, the CNN comprises a second buffer for storing a tensor input to the CNN, and the hardware implementation of the CNN is configured to perform, at the first layer, one or more convolutional operations on the basis of the current tensor and a previous tensor generating a current output tensor of the first layer originating from the data of the current scan. The previous tensor is the newest tensor stored in the second buffer and corresponds to the data of the previous scan. Further, the hardware implementation of the CNN is configured to store the current tensor as the newest tensor in the second buffer.
In an implementation form of the second aspect, each buffer is a serial-in parallel-out buffer configured to store a new tensor as newest tensor and, in case the buffer is full, to simultaneously drop the oldest tensor stored in the buffer. In an implementation form of the second aspect, the temporal size of each of the one or more first buffers for storing the output tensor of a respective preceding layer is one less than the temporal size of a convolutional kernel of the respective preceding layer.
In particular, one or more buffers have a different temporal size.
In an implementation form of the second aspect, each tensor is a tensor with two or more dimensions.
In particular, each tensor is a tensor with one or more spatial dimensions and one channel dimension.
In an implementation form of the second aspect, each tensor is a tensor with two dimensions, in particular with one spatial dimension and one channel dimension. Alternatively, each tensor is a tensor with three dimensions, in particular with two spatial dimensions and one channel dimension. Alternatively, each tensor is a tensor with four dimensions, in particular with three spatial dimensions and one channel dimension.
In an implementation form of the second aspect, the channel dimension of each first buffer for storing the output tensor of a respective preceding layer corresponds to the channel dimension of the respective preceding layer.
In an implementation form of the second aspect, the hardware implementation of the CNN is configured to generate output data for a navigation process of an ego-vehicle, on which the 2D or 3D sensor is arranged, wherein the hardware implementation of the CNN is configured to input, together with the data of the current scan, current location data of the ego-vehicle in a grid of a local navigational frame to the CNN, and pad and crop in each buffer the newest tensor being stored, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan.
In an implementation form of the second aspect, the hardware implementation of the CNN is configured to control autonomous movement of the ego-vehicle on the basis of the processing of the data stream. In an implementation form of the second aspect, the hardware implementation of the CNN is configured to store the location data in a location field of one or more of the one or more first buffers and the second buffer, and update the location field of the respective one or more buffers with the current location data in case the current location data of the ego- vehicle do not match the previous location data.
In an implementation form of the second aspect, the hardware implementation of the CNN is configured to generate on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by performing a change of coordinates of the data of the current scan into the local navigational frame.
In an implementation form of the second aspect, the hardware implementation of the CNN is configured to generate on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by additionally performing a voxelization.
In an implementation form of the second aspect, the hardware implementation of the CNN is configured to detect targets, in particular moving targets, in the vicinity of the 2D or 3D sensor on the basis of the output tensor of the last layer of the one or more further layers of the CNN. Alternatively or additionally, the hardware implementation of the CNN is configured to perform a point cloud semantic segmentation on the basis of the output tensor of the last layer of the one or more further layers of the CNN. Alternatively or additionally, the hardware implementation of the CNN is configured to perform a free space estimation on the basis of the output tensor of the last layer of the one or more further layers of the CNN.
The hardware implementation of the CNN of the second aspect and its implementation forms and optional features achieve the same advantages as the device of the first aspect and its respective implementation forms and respective optional features. The implementation forms and optional features of the device according to the first aspect are correspondingly valid for the hardware implementation of the CNN according to the second aspect.
In order to achieve the hardware implementation of the CNN according to the second aspect of the present disclosure, some or all of the implementation forms and optional features of the second aspect, as described above, may be combined with each other.
A third aspect of the present disclosure provides an ego-vehicle comprising one or more 2D or 3D sensors configured to measure distance, and a device according to the first aspect or any implementation form thereof. The one or more 2D or 3D sensors are configured to provide scans containing spatial information of the vicinity of the ego-vehicle in the form of a data stream to the device and the device is configured to process the data stream.
The term “ ego-vehicle ” may be understood as a mobile platform, in particular mobile robotic platform, bearing one or more sensors, such as one or more 2D or 3D sensors configured to measure distance, that computes respectively operates from the perspective of which the world respectively environment is perceived.
The ego-vehicle may correspond to a vehicle, such as a car, truck, motorcycle etc., an autonomous vehicle, such as an autonomous car, autonomous truck etc., a robot, such as a delivery robot, an autonomous robot, such as an autonomous delivery robot, a flying capable vehicle, such as a flying capable drone, flying capable robot, aircraft etc., or a autonomous flying capable vehicle, such as an autonomous flying capable drone, an autonomous flying capable robot, autonomous aircraft etc.
The ego-vehicle may comprise a localization unit configured to determine the current location of the ego-vehicle and, thus, the current location data of the ego-vehicle. The localization unit may be configured for a short-term localization, e.g. at the scope of Is to 10s. The localization unit may comprise or correspond to one or more inertial measurement units (IMU). In particular, the localization unit is configured to perform an odometry process, such as an inertial odometry process, a wheel odometry process and/or an optical respectively visual odometry process etc.
The one or more 2D or 3D sensors configured to measure distance may each comprise or correspond to one or more LIDAR sensors (light detection and ranging sensors), TOF cameras (time of flight cameras), stereo cameras and/or beamforming radars. That is, the one or more 2D or 3D sensors may each comprise or correspond to one or more visual- depth-capable sensors. The 2D or 3D sensor may provide scans at a frame rate of at least 10 to 20 frames per second.
The term “2D sensor configured to measure distance ” may be understood to correspond to a sensor that is configured to detect scans with two spatial dimensions (2 -dimensional scans respectively frames) and to measure distance. The term “3D sensor configured to measure distance ” may be understood to correspond to a sensor that is configured to detect scans with three spatial dimensions (3 -dimensional scans respectively frames) and to measure distance.
In an implementation form of the third aspect, the device is configured to control autonomous movement of the ego-vehicle on the basis of the processing of the data stream.
The ego-vehicle of the third aspect and its implementation forms and optional features achieve the same advantages as the device of the first aspect and its respective implementation forms and respective optional features.
In order to achieve the ego-vehicle according to the third aspect of the present disclosure, some or all of the implementation forms and optional features of the third aspect, as described above, may be combined with each other.
A fourth aspect of the present disclosure provides a method of employing a Convolutional Neural Network, CNN, in an inference phase, for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. The CNN comprises a first layer and at least one or more further layers following the first layer and one or more first buffers for storing an output tensor of a respective preceding layer. The method comprises the steps of inputting data of a current scan provided by the 2D or 3D sensor in the form of a current tensor into the CNN, performing, at each further layer, one or more convolutional operations on the basis of a current output tensor of the preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer, the previous output tensor of the preceding layer being the newest tensor stored in the respective first buffer of the one or more first buffers and originating from data of a previous scan inputted to the CNN directly before the data of the current scan, and storing the current output tensor of the preceding layer as the newest tensor in the respective first buffer.
In an implementation form of the fourth aspect, the CNN comprises one or more optional additional layers, wherein the method comprises the step of performing, at each optional additional layer, one or more convolutional operations on the basis of a current output tensor of a preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer originating from data of a previous scan inputted to the CNN directly before the data of the current scan.
In particular, the first layer, the one or more further layers and the one or more optional additional layers of the CNN are the layers of the CNN.
In an implementation form of the fourth aspect, the CNN comprises a second buffer for storing a tensor input to the CNN, and the method comprises the steps of performing, at the first layer, one or more convolutional operations on the basis of the current tensor and a previous tensor generating a current output tensor of the first layer originating from the data of the current scan, the previous tensor being the newest tensor stored in the second buffer and corresponding to the data of the previous scan, and storing the current tensor as the newest tensor in the second buffer. In an implementation form of the fourth aspect, each buffer is a serial-in parallel-out buffer and the method comprises the steps of storing a new tensor as newest tensor, and in case the buffer is full, simultaneously dropping the oldest tensor stored in the buffer.
In an implementation form of the fourth aspect, the temporal size of each of the one or more first buffers for storing the output tensor of a respective preceding layer is one less than the temporal size of a convolutional kernel of the respective preceding layer.
In particular, one or more buffers have a different temporal size.
In an implementation form of the fourth aspect, each tensor is a tensor with two or more dimensions.
In particular, each tensor is a tensor with one or more spatial dimensions and one channel dimension.
In an implementation form of the fourth aspect, each tensor is a tensor with two dimensions, in particular with one spatial dimension and one channel dimension. Alternatively, each tensor is a tensor with three dimensions, in particular with two spatial dimensions and one channel dimension. Alternatively, each tensor is a tensor with four dimensions, in particular with three spatial dimensions and one channel dimension.
In an implementation form of the fourth aspect, the channel dimension of each first buffer for storing the output tensor of a respective preceding layer corresponds to the channel dimension of the respective preceding layer.
In an implementation form of the fourth aspect, output data for a navigation process of an ego-vehicle, on which the 2D or 3D sensor is arranged, are generated, wherein the method comprises the steps of inputting, together with the data of the current scan, current location data of the ego- vehicle in a grid of a local navigational frame to the CNN, and padding and cropping in each buffer the newest tensor being stored, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan.
In an implementation form of the fourth aspect, the method comprises the step of controlling autonomous movement of the ego-vehicle on the basis of the processing of the data stream.
In an implementation form of the fourth aspect, the method comprises the steps of storing the location data in a location field of one or more of the one or more first buffers and the second buffer, and updating the location field of the respective one or more buffers with the current location data in case the current location data of the ego-vehicle do not match the previous location data.
In an implementation form of the fourth aspect, the method comprises the step of generating on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by performing a change of coordinates of the data of the current scan into the local navigational frame.
In an implementation form of the fourth aspect, the method comprises the step of generating on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by additionally performing a voxelization.
In an implementation form of the fourth aspect, the method comprises the step of detecting targets, in particular moving targets, in the vicinity of the 2D or 3D sensor on the basis of the output tensor of the last layer of the one or more further layers of the CNN. Alternatively or additionally, the method comprises the step of performing a point cloud semantic segmentation on the basis of the output tensor of the last layer of the one or more further layers of the CNN. Alternatively or additionally, the method comprises the step of performing a free space estimation on the basis of the output tensor of the last layer of the one or more further layers of the CNN. The method of the fourth aspect and its implementation forms and optional features achieve the same advantages as the device of the first aspect and its respective implementation forms and respective optional features.
The implementation forms and optional features of the device according to the first aspect are correspondingly valid for the method according to the fourth aspect.
In order to achieve the method according to the fourth aspect of the present disclosure, some or all of the implementation forms and optional features of the fourth aspect, as described above, may be combined with each other.
A fifth aspect of the present disclosure provides a computer program comprising program code for performing the method according to the fourth aspect or any of its implementation forms.
In particular, the fifth aspect of the present disclosure provides a computer program comprising program code for performing when implemented on a processor, the method according to the fourth aspect or any of its implementation forms.
A sixth aspect of the present disclosure provides a computer program product comprising program code for performing when implemented on a processor, a method according to the fourth aspect or any implementation form thereof.
A seventh aspect of the present disclosure provides a non-transitory storage medium storing executable program code which, when executed by a processor, causes the method according to the fourth aspect or any of its implementation forms to be performed.
An eighth aspect of the present disclosure provides a computer comprising a memory and a processor, which are configured to store and execute program code to perform a method according to the fourth aspect or any implementation form thereof.
The memory may be distributed over a plurality of physical devices. A plurality of processors that co-operate in executing the program code may be referred to as a processor. It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.
BRIEF DESCRIPTION OF DRAWINGS
The above described aspects and implementation forms will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which
Figure 1 shows an example of an operation of a Convolutional Neural Network (CNN) at a current time point ti.
Figure 2 shows a device according to an embodiment of the invention and an ego- vehicle according to an embodiment of the invention.
Figure 3A shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
Figure 3B shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
Figure 4 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti. Figure 5 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
Figure 6 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
Figure 7 shows a grid of a local navigational frame with a location of an ego-vehicle according to an embodiment of the invention for two directly subsequent time points.
Figure 8 shows a diagram of a method according to an embodiment of the invention with respect to storing tensors originating from data of a current scan of a current time point ti in buffers of a Convolutional Neural Network (CNN) according to an embodiment of the invention.
Figure 9 shows a diagram of a method of employing a Convolutional Neural Network
(CNN) in an inference phase according to an embodiment of the invention for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance.
DETAILED DESCRIPTION OF EMBODIMENTS
In the Figures 1 to 10 corresponding elements are labelled by the same reference signs.
Figure 2 shows on the left side a device 1 according to an embodiment of the invention and on the right side an ego-vehicle 5 according to an embodiment of the invention.
The above description with respect to the device according to the first aspect and its implementation forms is correspondingly valid for the device 1 of Figure 2. The above description with respect to the ego-vehicle according to the third aspect of the invention and its implementation forms is correspondingly valid for the ego-vehicle 5 of Figure 2.
As shown on the left side of Figure 2, the device 1 comprises a Convolutional Neural Network 3 (CNN) and is configured to employ the CNN in inference phase. The device 1 is configured to receive and process a data stream of scans containing spatial information provided by a 2D or 3D sensor 2 configured to measure distance.
The CNN 3 of the device 1 comprises a first layer and one or more further layers following the first layer (not shown in Figure 2). The CNN 3 of the device 1 further comprises one or more first buffers 4a for storing an output tensor of a respective preceding layer.
The device 1 is configured to input data of a current scan, which is provided by the 2D or 3D sensor 2, in the form of a current tensor into the CNN 3, perform, at each further layer of the CNN 3, one or more convolutional operations on the basis of a current output tensor of the preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer, the previous output tensor of the preceding layer being the newest tensor stored in the respective first buffer 4a of the one or more first buffers 4a and originating from data of a previous scan inputted to the CNN 3 directly before the data of the current scan, and store the current output tensor of the preceding layer as the newest tensor in the respective first buffer 4a.
Embodiments of the CNN 3, in particular an operation of embodiments of the CNN 3, are shown in the Figures 3A, 3B, 4, 5 and 6.
The device 1 of Figure 2 allows to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by the 2D or 3D sensor 2. Since the CNN 3 comprises one or more first buffers 4a for storing the output tensor of a respective preceding layer, the need of re-computation of one or more previous tensors, in particular output tensors, of respective preceding layers for performing one or more convolutional operations at each further layer at a current time point is overcome. Namely, the device 1 is configured to perform, at each further layer, one or more convolutional operations on the basis of the current output tensor of the preceding layer originating from the data of the current scan and the previous output tensor of the preceding layer being the newest tensor stored in the respective first buffer 4a. As a result of using one or more first buffers 4a the inference time of the CNN 3 in the inference phase is reduced.
As shown on the right side of Figure 2 the device 1 and the 2D or 3D sensor 2 may be part of the ego-vehicle 5. That is, the ego-vehicle 5 comprises the 2D or 3D sensor 2 and the device 1. The ego-vehicle 5 may also comprise more than one 2D or 3D sensor 2. Therefore, the ego-vehicle 5 comprises one or more 2D or 3D sensors 2 configured to measure distance and the device 1, wherein the one or more 2D or 3D sensors 2 are configured to provide scans containing spatial information of the vicinity respectively environment of the ego-vehicle 5 in the form of a data stream to the device 1 and the device 1 is configured to process the data stream.
The one or more 2D or 3D sensors 2 configured to measure distance may each comprise or correspond to one or more LIDAR sensors (light detection and ranging sensors), TOF cameras (time of flight cameras), stereo cameras and/or beamforming radars. That is, the one or more 2D or 3D sensors 2 may each comprise or correspond to one or more visual- depth-capable sensors. The one or more 2D or 3D sensors 2 may provide scans at a frame rate of at least 10 to 20 frames per second.
The ego-vehicle 5 may correspond to a vehicle, such as a car, truck, motorcycle etc., an autonomous vehicle, such as an autonomous car, autonomous truck etc., a robot, such as a delivery robot, an autonomous robot, such as an autonomous delivery robot, a flying capable vehicle, such as a flying capable drone, flying capable robot, aircraft etc., or a autonomous flying capable vehicle, such as an autonomous flying capable drone, an autonomous flying capable robot, autonomous aircraft etc.
The ego-vehicle 5 may comprise a localization unit (not shown in Figure 2) configured to determine the current location of the ego-vehicle 5 and, thus, the current location data of the ego-vehicle 5. The localization unit may be configured for a short-term localization, e.g. at the scope of Is to 10s. The localization unit may comprise or correspond to one or more inertial measurement units (IMU). In particular, the localization unit is configured to perform an odometry process, such as an inertial odometry process, a wheel odometry process and/or an optical respectively visual odometry process etc.
Figure 3A shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
The above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 3A. The CNN 3 of Figure 3A may be an embodiment of the CNN 3 of the device 1 of Figure 2.
According to the embodiment of Figure 3 A, the CNN 3 comprises a first layer LI, one further layer L2a and one first buffer 4a, wherein the further layer L2a is a consecutive layer following the first layer LI. That is, according to the embodiment of Figure 3 A, the first layer LI is the preceding layer of the further layer L2a. As described already above, the CNN 3 may comprise more than one further layer. In addition or alternatively, the CNN 3 may comprise one or more optional additional layers (not shown in Figure 3A).
As shown in Figure 3A, at a current time point ti (the time T equals to the time point ti) data of a current scan provided by a 2D or 3D sensor 2 configured to measure distance (not shown in Figure 3 A) are input in the form of a current tensor epc(ti) of the current time point ti into the CNN 3. At the further layer L2a one or more convolutional operations are performed on the basis of a current output tensor al(ti) of the preceding layer LI (which is the first layer LI), wherein the current output tensor al(ti) originates from the data of the current scan of the current time point ti, and a previous output tensor al(ti-i) of the preceding layer LI, wherein the previous output tensor al(ti-i) of the preceding layer LI is the newest tensor stored in the first buffer 4a and originates from data of a previous scan (not shown in Figure 3A) inputted to the CNN directly before the data of the current scan. The previous scan is inputted at a previous time point ti-i directly before the current time point ti and, thus, the previous scan may be referred to as the scan of the previous time point ti-i .
The current output tensor a2(ti) of the preceding layer LI may be stored as the newest tensor in the first buffer 4a (not shown in Figure 3A). Therefore, at the directly subsequent time point ti+i directly after the current time point ti, one or more convolutional operations are performed on the basis of the directly subsequent output tensor al(ti+i) of the preceding layer LI originating from the data of the scan provided by the 2D or 3D sensor at the directly subsequent time point ti+i and the current output tensor al(ti) stored as the newest tensor in the first buffer 4a at the directly subsequent time point ti+i .
The CNN 3 of Figure 3 A allows to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by the 2D or 3D sensor 2. Since the CNN 3 comprises the first buffer 4a for storing the output tensor of the first layer LI, the need of re-computation of a previous output tensor al(ti-i) of the first layer LI for performing one or more convolutional operations at the further layer L2a at a current time point ti is overcome. Namely, at the further layer L2a one or more convolutional operations are performed on the basis of the current output tensor al(ti) of the first layer LI originating from the data of the current scan and the previous output tensor al(ti-i) of the first layer LI being the newest tensor stored in the first buffer 4a. As a result of using the first buffer 4a the inference time of the CNN 3 in the inference phase is reduced.
According to Figure 3A the temporal size of the first buffer 4a equals to one (“7”), because the first buffer 4a may store the output tensor of the first layer LI for only one time point. As a result, when the current output tensor al(ti) of the preceding layer LI is stored at the current time point ti as the newest tensor in the first buffer 4a, the previous output tensor al(ti-i) of the preceding layer LI, which was stored as the newest tensor in the first buffer 4a at the previous time point ti-i directly before the current time point ti, is simultaneously dropped respectively deleted at the current time point ti. In particular, the temporal size of the first buffer 4a is one less than the temporal size of a convolutional kernel of the first layer LI. Thus, according to the embodiment of Figure 3A, the temporal size of the convolutional kernel of the first layer LI may correspond to two (“2”)· The temporal size of the first buffer may alternatively correspond to a number greater than one (“1”) and, thus, the first buffer may be configured to store the output tensor of the preceding layer LI for more than one time point, i.e. for two or more directly subsequent time points, respectively for more than one scan, i.e. for two or more directly subsequent scans.
Figure 3B shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
The CNN 3 of Figure 3B may be an embodiment of the CNN 3 of the device 1 of Figure 2. The CNN 3 of Figure 3B differs from the CNN 3 of Figure 3A in that the CNN 3 of Figure 3B comprises a second buffer 4b for storing a tensor input to the CNN 3. Therefore, the description of the CNN 3 of Figure 3A is also valid for the CNN 3 of Figure 3B. The above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 3B. In the following mainly the additional feature(s) of the CNN 3 of Figure 3B respectively the differences between the CNN 3 of Figure 3A and the CNN 3 of Figure 3B are described.
As shown in Figure 3B, the CNN 3 comprises, besides the first layer LI, the one further layer L2a and the first buffer 4a, a second buffer 4b for storing a tensor input to the CNN 3. At the first layer LI, one or more convolutional operations are performed on the basis of the current tensor epc(ti) originating from the data of the current scan of the current time point ti and a previous tensor epc(ti-i) generating a current output tensor al(ti) of the first layer LI originating from the data of the current scan, wherein the previous tensor epc(ti-i) is the newest tensor stored in the second buffer 4b at the time point ti-i and corresponds to the data of the previous scan inputted to the CNN 3 at the time point ti-i directly before the data of the current scan input to the CNN 3 at the current time point ti. The current output tensor al(ti) of the first layer LI may be stored as the newest tensor in the second buffer 4b (not shown in Figure 3B). Therefore, at the directly subsequent time point ti+i directly after the current time point ti, one or more convolutional operations are performed on the basis of the directly subsequent tensor epc(ti+i) originating from the data of the scan provided by the 2D or 3D sensor at the directly subsequent time point ti+i and the current tensor epc(ti) stored as the newest tensor in the second buffer 4b at the directly subsequent time point ti+i.
Therefore, compared to the CNN 3 of Figure 3A, the CNN 3 of Figure 3B allows to further reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. Since the CNN 3 comprises a second buffer 4b for storing the tensor input to the CNN 3, the need of re-computation of a previous tensor epc(t-l) input to the CNN 3 on the basis of the corresponding data of the corresponding previous scan for performing one or more convolutional operations at the first layer LI at the current time point t, is overcome. Namely, at the first layer LI, one or more convolutional operations are performed on the basis of the current tensor epc(ti) and the previous tensor epc(ti-i) being the newest tensor stored in the second buffer 4b.
According to Figure 3B the temporal size of the second buffer 4b equals to one (“7”). because the second buffer 4b may store the tensor input to the CNN 3 for only one time point respectively for one scan. As a result, when the current tensor epc(ti) is stored at the current time point ti as the newest tensor in the second buffer 4b, the previous tensor epc(ti- i), which was stored as the newest tensor in the second buffer 4b at the previous time point ti-i directly before the current time point ti, is simultaneously dropped respectively deleted at the current time point ti.
The temporal size of the second buffer 4b may alternatively correspond to a number greater than one (“7”) and, thus, the second buffer 4b may be configured to store the tensor of the input to the CNN 3 for more than one time point, i.e. for two or more directly subsequent time points, respectively for more than one scan, i.e. for two or more directly subsequent scans. In an embodiment of the present invention, the first buffer 4a and the second buffer 4b have a different temporal size. Alternatively, they may have the same temporal size.
Figure 4 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
The CNN 3 of Figure 4 may be an embodiment of the CNN 3 of the device 1 of Figure 2. The CNN 3 of Figure 4 differs from the CNN 3 of Figure 3A in that the CNN 3 of Figure 4 comprises a first buffer 4a with a temporal size of two (“2”). Therefore, the description of the CNN 3 of Figure 3A is also valid for the CNN 3 of Figure 4. The above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 4. In the following mainly the additional feature(s) of the CNN 3 of Figure 4 respectively the differences between the CNN 3 of Figure 3 A and the CNN 3 of Figure 4 are described.
According to Figure 4 the temporal size of the first buffer 4a equals to two (“2”), because the first buffer 4a may store the output tensor of the first layer LI (which is the preceding layer of the further layer L2a) for two directly subsequent time points respectively for two directly subsequent scans. As shown in Figure 4, at the current time point ti the first buffer 4a stores as the newest tensor the previous output tensor al(ti-i) of the first layer LI originating from data of a previous scan inputted to the CNN 3 at the previous time point ti-i directly before the data of the current scan, which are input at the current time point ti to the CNN 3 in the form of the current tensor epc(ti), and the second previous output tensor a 1 (ti-2) of the first layer LI originating from data of a second previous scan inputted to the CNN 3 at the second previous time point ti-2 directly before the previous time point tn.
As a result, when the current output tensor al(ti) of the first layer LI is stored at the current time point ti as the newest tensor in the first buffer 4a, the second previous output tensor al(ti-2) of the first layer LI is simultaneously dropped respectively deleted at the current time point ti (This is indicated in the dashed box on the left side of Figure 4). Namely, the temporal size of the first buffer 4a is only two and, thus, the first buffer 4a may store the output tensor of the first layer LI for only two directly subsequent time points, such as for the two directly subsequent time points tw and ti-i or ti-i and ti.
The temporal size of the first buffer may alternatively correspond to a number greater than two (“2”) and, thus, the first buffer may be configured to store the output tensor of the first layer LI for more than two time points, i.e. for three or more directly subsequent time points, respectively for more than two scans, i.e. for three or more directly subsequent scans.
Figure 5 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
The CNN 3 of Figure 5 may be an embodiment of the CNN 3 of the device 1 of Figure 2. The CNN 3 of Figure 5 differs from the CNN 3 of Figure 4 in that the CNN 3 of Figure 5 comprises a second buffer 4b for storing a tensor input to the CNN 3. Therefore, the description of the CNN 3 of Figure 3B and 4 is also valid for the CNN 3 of Figure 5. The above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 5. In the following mainly the additional feature(s) of the CNN 3 of Figure 5 respectively the differences between the CNN 3 of Figure 4 and the CNN 3 of Figure 5 are described.
As shown in Figure 5, the CNN 3 comprises, besides the first layer LI, the one further layer L2a and the first buffer 4a, a second buffer 4b for storing a tensor input to the CNN 3.
According to Figure 5 the temporal size of the second buffer 4b equals to two (“2”), because the second buffer 4b may store a tensor that is input to the CNN 3 for two directly subsequent time points. As shown in Figure 5, at the current time point ti the second buffer 4a stores as the newest tensor the previous tensor epc(ti-i) corresponding to data of a previous scan inputted to the CNN 3 at the previous time point ti-i directly before the data of the current scan, which are input at the current time point ti to the CNN 3 in the form of the current tensor epc(ti), and the second previous tensor epc(tj-2) corresponding to data of a second previous scan inputted to the CNN 3 at the second previous time point ti-2 directly before the previous time point ti-i.
As a result, when the current tensor al(ti) is stored at the current time point ti as the newest tensor in the second buffer 4b, the second previous tensor epc(ti-2) is simultaneously dropped respectively deleted at the current time point ti (This is indicated in the dashed box on the left side of Figure 5). Namely, the temporal size of the second buffer 4b is only two and, thus, the second buffer 4b may store the tensor that is input to the CNN 3 for only two directly subsequent time points, such as for the two directly subsequent time points ti-2 and ti-i or ti-i and ti.
The temporal size of the second buffer 4b may alternatively correspond to a number greater than two (“2”) and, thus, the second buffer 4b may be configured to store the tensor input to the CNN 3 for more than two time points, i.e. for three or more directly subsequent time points.
Figure 6 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.
The CNN 3 of Figure 6 may be an embodiment of the CNN 3 of the device 1 of Figure 2. The CNN 3 of Figure 6 differs from the CNN 3 of Figure 3 A in that the CNN 3 of Figure 6 comprises three further layers L2a, L2b and L2c and an optional second buffer 4b for storing a tensor input to the CNN 3 and in that an optional voxelization VX is performed. Therefore, the description of the CNN 3 of Figures 3 A and 3B is also valid for the CNN 3 of Figure 6. The above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 6. In the following mainly the additional feature(s) of the CNN 3 of Figure 6 respectively the differences between the CNN 3 of Figure 3 A and the CNN 3 of Figure 6 are described.
According to the embodiment of Figure 6, the CNN 3 comprises a first layer LI, three further layers L2a, L2b and L2c, three first buffers 4a and one optional second buffer 4b. The first further layer L2a is a consecutive layer following the first layer LI, the second further layer L2b is a consecutive layer following the first further layer L2a and the third further layer L2c is a consecutive layer following the second further layer L2b. That is, according to the embodiment of Figure 6, the first layer LI is the preceding layer of the first further layer L2a, the first further layer L2a is the preceding layer of the second further layer L2b and the second further layer L2b is the preceding layer of the third further layer L2c.
As described already above, the CNN 3 may comprise only one further layer or only two further layers or more than three further layers. In addition or alternatively, the CNN 3 may comprise one or more optional additional layers (not shown in Figure 6). In an embodiment one or more of the one or more optional additional layers may be arranged between further layers and/or between the first layer and the first further layer.
As shown in Figure 6, at the current time point f data pc(ti) of a current scan provided by a 2D or 3D sensor configured to measure distance (not shown in Figure 6), such as the 2D or 3D sensor 2 of Figure 2, are input in the form of a current tensor epc(ti) of the current time point f into the CNN 3. The current tensor epc(ti) may be generated on the basis of the data pc(ti) of the current scan by optionally performing a voxelization VX.
In an embodiment of the invention, the current tensor epc(ti) may be generated on the basis of the data pc(ti) of the current scan and the current location data of an ego-vehicle, such as the ego-vehicle 5 of Figure 2, in the grid of a local navigational frame by performing a voxelization VX, in case the 2D or 3D sensor providing the data pc(ti) of the current scan is arranged on the ego-vehicle.
In particular, the data of the current scan pc(ti) corresponds to a point cloud, wherein a point cloud is a set of points in an N-dimensional space along with their attributes, wherein N corresponds to the number of spatial dimensions. At each further layer Lj (j may be 2a, 2b or 2c) one or more convolutional operations are performed on the basis of a current output tensor a (ti) of the preceding layer Lv, wherein the current output tensor aw(ti) originates from the data pc(ti) of the current scan of the current time point ti, and a previous output tensor aw(ti-i) of the preceding layer Lv, wherein the previous output tensor aw(ti-i) of the preceding layer Lv is the newest tensor stored in the respective first buffer 4a and originates from data of a previous scan (not shown in Figure 6) inputted to the CNN directly before the data pc(ti) of the current scan (w = 1 and v = 1, in case j = 2a; w = 2 and v = 2a, in case j = 2b; and w = 3 and v = 2b, in case j = 2c).
The previous scan is inputted at a previous time point ti-i directly before the current time point ti and, thus, the previous scan may be referred to as the scan of the previous time point ti-i.
The current output tensor aw(ti) of the preceding layer Lv may be stored as the newest tensor in the corresponding first buffer 4a (shown in the dashed box on the left side Figure 6). Therefore, at the directly subsequent time point h+i directly after the current time point ti, at each further layer Lj (j may be 2a, 2b or 2c) one or more convolutional operations are performed on the basis of the directly subsequent output tensor aw(ti+i) of the preceding layer Lv originating from the data of the scan provided by the 2D or 3D sensor at the directly subsequent time point ti+i and the current output tensor aw(ti) stored as the newest tensor in the corresponding first buffer 4a at the directly subsequent time point ti+i.
According to Figure 6 the temporal size of the first buffers 4a equals to one (“7”), because the first buffers 4a may store the output tensor of the corresponding layer (preceding layer) for only one time point respectively one scan. As a result, when at the current time point ti the current output tensor aw(ti) of the corresponding layer Lv (preceding layer) is stored as the newest tensor in the corresponding first buffer 4a, the previous output tensor aw(ti-i) of the corresponding layer Lv, which was stored as the newest tensor in the corresponding first buffer 4a at the previous time point ti-i directly before the current time point ti, is simultaneously dropped respectively deleted at the current time point ti (This is indicated in the dashed box on the left side; w = 1, in case v = 1 ; w = 2, in case v = 2a; and w =3, in case v = 2b). In particular, the temporal size of a first buffer 4a is one less than the temporal size of a convolutional kernel of the corresponding layer (preceding layer). Thus, according to the embodiment of Figure 6, the temporal size of the convolutional kernel of the layers LI, L2a and L2b may correspond to two (“2”).
The temporal size of one or more first buffers 4a may alternatively correspond to a number greater than one (“7”) and, thus, the one or more first buffers 4a may be configured to store the output tensor of a corresponding preceding layer for more than one time point, i.e. for two or more directly subsequent time points, respectively for more than one scan, i.e. for two or more directly subsequent scans.
According to the embodiment of Figure 6, at the first layer LI one or more convolutional operations are performed on the basis of the current tensor epc(ti) originating from the data pc(tj) of the current scan of the current time point (ti) and a previous tensor epc(ti-i) generating a current output tensor a 1 (ti) of the first layer LI originating from the data pc(ti) of the current scan. The previous tensor epc(ti-i) is the newest tensor stored in the optional second buffer 4b and corresponds to the data of the previous scan inputted to the CNN 3 at the previous time point ti-i directly before the data pc(ti) of the current scan.
The current output tensor al(ti) of the first layer LI may be stored as the newest tensor in the optional second buffer 4b (shown in the dashed box on the left of Figure 6). Therefore, at the directly subsequent time point ti+i directly after the current time point ti, one or more convolutional operations are performed on the basis of the directly subsequent tensor epc(ti+i) originating from the data of the scan provided by the 2D or 3D sensor at the directly subsequent time point ti+i and the current tensor epc(ti) stored as the newest tensor in the optional second buffer 4b at the directly subsequent time point ti+i .
According to Figure 6 the temporal size of the optional second buffer 4b equals to one (“7”), because the optional second buffer 4b may store the tensor input to the CNN 3 for only one time point respectively one scan. As a result, when the current tensor epc(ti) is stored at the current time point ti as the newest tensor in the second buffer 4b, the previous tensor epc(ti-i), which was stored as the newest tensor in the second buffer 4b at the previous time point ti-i directly before the current time point ti, is simultaneously dropped respectively deleted at the current time point ti (shown in the dashed box on the left of Figure 6).
The temporal size of the optional second buffer 4b may alternatively correspond to a number greater than one (“7”) and, thus, the second buffer 4b may be configured to store the tensor of the input to the CNN 3 for more than one time point, i.e. for two or more directly subsequent time points, respectively for more than one scan, i.e. for two or more directly subsequent scans.
In an embodiment of the present invention, one or more of the first buffers 4a and the optional second buffer 4b have a different temporal size. Alternatively, the first buffers 4a and the optional second buffer 4b may have the same temporal size.
Figure 7 shows a grid G of a local navigational frame LNF with a location of an ego-vehicle according to an embodiment of the invention for two directly subsequent time points.
The ego-vehicle (not shown in Figure 7) may correspond to the ego-vehicle 5 of Figure 2. The above description with respect to the device according to the first aspect of the invention and its implementation forms and the above description of the ego-vehicle according to the third aspect of the invention and its implementation forms are correspondingly valid for the description of Figure 7, in particular for the description of the ego-vehicle 5 of Figure 7.
In the following reference is made to the ego-vehicle 5 shown in Figure 2. The ego-vehicle 5 may be configured for autonomous movement. In particular, the device 1 of the ego- vehicle 5 is configured to control autonomous movement of the ego-vehicle 5 on the basis of the processing of the data stream of scans provided by the one or more 2D or 3D sensors 2 configured to measure distance and arranged on the ego-vehicle 5.
During the autonomous movement of the ego-vehicle 5, detection in the local navigational frame LNF is performed by the one or more 2D or 3D sensors 2. The local navigational frame LNF shown in Figure 7 is a coordinate system of two spatial dimension x, y with top-down view on the ground surface, which is relative to a position of the ego-vehicle 5. That is, the location respectively position of the ego-vehicle 5 is within the local navigational frame LNF. The terms “ location ” and “ position ” may be used as synonyms.
In the local navigational frame LNF, the data (e.g. a point cloud) of a current scan (of the data stream of scans provided by the one or more 2D or 3D sensors 2) is registered according to the current location of the ego-vehicle 5. The region of interest (ROI) for processing the data stream of scans provided by the one or more 2D or 3D sensors 2 is moving with the location of the ego-vehicle 5 in the local navigational frame LNF, when the ego-vehicle 5 is moving. In particular, the ROI corresponds to a predefined regular grid respectively area of the local navigational frame LNF with the ego-vehicle 5 in the center wherein the regular grid is composed of a plurality of cells. Thus, Figure 7 shows the grid Gi (area) of the current ROI Ri at a current time point ti and the grid Gi-i (area) of the previous ROI Ri-i of a previous time point ti-i that is directly before the current time point ti-i . As an example, the ROI may be chosen to correspond to a grid in the local navigational frame LNF of size 140 m x 140 m. The cells may each be of size 0.5m x 0.5 m. In particular, the ROI may be chosen to correspond to a grid in the local navigational frame LNF of size between 70 m x 70 m and 250 m x 250. In particular, the cells may each be of size between 0.05 m x 0.05 m and 0.5 m x 0.5 m.
According to Figure 7, the data of a current scan (e.g. all points of a point cloud, such as a LIDAR point cloud in case of the 2D or 3D sensor being a LIDAR sensor) are mapped into the regular grid G of the local navigational frame LNF. That is, the device 1 is configured to generate on the basis of the data of the current scan and the current location data pxi and pyi of the ego-vehicle in the grid G of the local navigational frame LNF the current tensor input to the CNN by performing a change of coordinates of the data of the current scan into the local navigational frame LNF. The current location data pxi and pyi describe the current location pi respectively position of the ego-vehicle 5 within the grid G of the local navigational frame LNF at a current time point ti. Thus, the tensors stored in the buffers of the CNN 3 correspond to the corresponding cells of the regular grid G of the local navigational frame LNF.
The grid Gi (area) of Figure 7 corresponds to a current tensor that is stored in a corresponding buffer of the CNN 3 as the newest tensor at the current time point ti. The grid G, corresponds to the predefined grid (area) of the current ROI Ri at the current time point ti. In Figure 7, the grid Gi is made of the plain cells without any pattern and the cells with a dotted pattern within the corresponding bold frame. The grid GM of Figure 7 corresponds to a previous tensor that is stored in the same buffer of the CNN 3 as the newest tensor at a previous time point ti-i directly before the current time point ti. The grid Gi-i corresponds to the predefined grid (area) of the previous ROI R at the previous time point ti-i. In Figure 7, the grid Gi-i is made of the cells with a diagonally stripped pattern and the plain cells without any pattern within the corresponding bold frame.
In (the grid G of) the local navigational frame LNF the grid Gi of the current time point ti is not congruent to the grid Gj-i of the previous time point tj-i, because the ego-vehicle 5 has moved from the previous location pi-i of the previous time point ti-i to the current location pi of the current time point ti. That is, at the previous time point ti-i the ego-vehicle 5 was at the previous location pi-i and at the current time point ti the ego-vehicle 5 is at the current position pi. As a result of this movement, the ROI changes from the previous ROI Ri-i to the current ROI Ri. The previous location pi-i of the ego-vehicle 5 is described by the previous location data pxi-i, pyi-i and the current location pi of the ego-vehicle 5 is described by the current location data pxi, pyi.
In case the current location data pxi, pyi of the ego-vehicle 5 do not match the previous location data pxi-i, pyi-i inputted together with the data of the previous scan, in each buffer of the CNN the newest tensor being stored is padded and cropped. That is, the device 1 of the ego-vehicle 5 may be configured to input, together with the data of the current scan, current location data pxi, pyi of the ego-vehicle 5 in the grid G of the local navigational frame LNF to the CNN 3 (the CNN is not shown in Figure 7), and pad and crop (in particular zero-pad and crop) in each buffer of the CNN the newest tensor being stored, in case the current location data pxi, pyi of the ego-vehicle 5 do not match previous location data pxi-i, pyi-i inputted together with the data of the previous scan
Thus, according to Figure 7, the plain cells within the bold frame correspond to the data of the newest tensor being stored in a buffer that may be re-used despite the location change of the ego-vehicle 5 and, thus, despite the change of the ROI from the previous ROI R to the current ROI Ri. The cells with the diagonally stripped pattern correspond to the data of the newest tensor stored in the buffer that are cropped. The terms “ dropped ’ or “ deleted ” may be used as synonyms for the term “ cropped \ And the cells with the dotted pattern correspond to the data of the newest tensor stored in the buffer that are padded, in particular that are zero-padded. Padding data may be understood as overwriting the values of the data by a predefined value. Thus, zero-padding data may be understood as overwriting the values of the data with zeros.
In case the ego-vehicle 5 moves within the same cell of the grid G (not shown in Figure 7), there is no padding and cropping performed. In other words, in case the current location data pxi, pyi of the ego-vehicle 5 do not match previous location data pxi-i, pyi-i inputted together with the data of the previous scan, but the current location data pxi, pyi and the previous location data pxi-i, pyi-i of the ego-vehicle 5 lie respectively are located within the same cell of the grid G of the local navigational frame LNF, padding and cropping is not performed.
The device 1 of the ego-vehicle 5 may be configured to generate on the basis of the data of the current scan and the current location data pxi, pyi of the ego-vehicle 5 in the grid G of the local navigational frame LNF the current tensor that is input to the CNN 3 by performing a change of coordinates of the data of the current scan into the local navigational frame LNF.
Further, the device 1 of the ego-vehicle 5 may be configured to generate on the basis of the data of the current scan and the current location data pxi, pyi of the ego-vehicle 5 in the grid G of the local navigational frame LNF the current tensor that is input to the CNN 3 by additionally performing a voxelization (not shown in Figure 7).
The location data may be stored in a location field of one or more of the one or more first buffers 4a and the optional second buffer of the CNN, and the location field of the respective one or more buffers is updated (not shown in Figure 7) with the current location data pxi, pyi, in case the current location data px,, pyi of the ego-vehicle 5 do not match the previous location data pxi-i, pyi-i.
In case the ego-vehicle 5 moves within the same cell of the grid G, there is no updating of the location field of the respective one or more buffers. In other words, in case the current location data px,, py, of the ego-vehicle 5 do not match previous location data pxi-i, pyi-i inputted together with the data of the previous scan, but the current location data pxi, pyi and the previous location data pxi-i, pyi-i of the ego-vehicle 5 lie respectively are located within the same cell of the grid G, the location field of the respective one or more buffers is not updated with the current location data pxi, pyi.
Figure 8 shows a diagram of a method according to an embodiment of the invention with respect to storing tensors originating from data of a current scan of a current time point ti in buffers of a CNN according to an embodiment of the invention.
In the following reference is made to the ego-vehicle 5 of Figure 2 and the local navigational frame of Figure 7.
In the first step S81 of the method of Figure 8 the device 1 of the ego-vehicle 5 obtains data of a current scan provided by the one or more 2D or 3D sensors 2 in the form of a current tensor of shape [C, H, W] together with current location data pxi, pyi of the ego- vehicle 5 in the grid G of the local navigational frame LNF at a current time point ti. “C” is the number of channels. The current tensor is generated by performing a change of coordinates of the data of the current scan into the local navigational frame LNF. Thus, “H” is the size of the current tensor along the y coordinate and “W” is the size of the current tensor along the x coordinate.
In the second step S82 following the first step S81, the device 1 determines whether the current location data pxi, pyi do not match the previous location data pxi-i, pyi-i of the ego- vehicle 5. That is, the device 1 determines whether the location of the ego-vehicle 5 has changed. In particular, the device 1 determines whether the current location data pxi, pyi do not match the previous location data pxi-i, pyi-i of the ego-vehicle 5 and whether the current location data px,, pyi and the previous location data pxi-i, pyi-i do not lie within the same cell of the grid G of the local navigational frame LNF. That is, the device 1 in particular determines whether the location of the ego-vehicle 5 has changed such that the current location of the ego-vehicle is in a different cell compared to the cell of the previous location. In case the determination of the second step S82 yields a “YES” (the location of the ego-vehicle has changed), the method proceeds to the third step S83. In case the determination of the second step S82 yields a “NO” (the location of the ego-vehicle has not changed or the location of the ego-vehicle is still in the same cell), the method proceeds to the fifth step S85.
In the third step S83, the device 1 pads and crops, in particular zero-pads and crops, in each buffer of the CNN 3 the newest tensor being stored to match to the current location data pxi, py, and, thus, the current ROI Ri. That is, the tensor that is stored at the previous time point ti-i directly before the current time point ti in a respective buffer as the newest tensor and that is still stored at the third step S83 as the newest tensor of the respective buffer is padded and cropped.
In the fourth step S84 following the third step S83, the device 1 updates the location field with the current location data pxi, py, of the ego-vehicle 5 in the buffers comprising a location field.
In the fifth step S85, the device 1 stores in each buffer the corresponding tensor originating from the data of the current scan as the newest tensor and simultaneously drops respectively deletes the oldest tensor in the buffers that are full. In particular, the device 1 stores in each first buffer 4a the output tensor of the corresponding layer (preceding layer) originating from the data of the current scan as the newest tensor. Moreover, in case the CNN 3 of the device 1 comprises a second buffer, the device 1 stores in the second buffer the current tensor originating from the data of the current scan as the newest tensor.
In particular, the method of Figure 8 is repeated by the device 1 of the go-vehicle 5 for each new time point respectively new scan.
Figure 9 shows a diagram of a method of employing a Convolutional Neural Network (CNN) in an inference phase according to an embodiment of the invention for processing a data stream of scans containing spatial information provided by one or more 2D or 3D sensor configured to measure distance. The method of Figure 9 may be performed by the device 1 of Figure 2. The above description of the device according to the first aspect and its implementation forms, the above description of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the method of Figure 9.
According to Figure 9, the method step SI 01 and the optional method step SI 02 may be performed on the basis of data pc(ti) of a current scan provided by a 2D or 3D sensor configured to measure distance, such as a LIDAR sensor, at a current time point ti. The 2D or 3D sensor is arranged on an ego-vehicle, such as the ego-vehicle 5 of Figure 2. In the step S 101, the current tensor epc(ti), that is input to the Convolutional Neural Network 3 (CNN), is generated on the basis of the data pc(ti) of the current scan and the current location data pxi, pyj of the ego-vehicle in a grid of a local navigational frame, such as the local navigational frame of Figure 7, by performing a change of coordinates of the data pc(ti) into the local navigational frame. On the basis of the data pc(ti) of the current scan and the current location data pxi, pyi of the ego-vehicle in the grid of the local navigational frame the current tensor epc(ti), that is input to the Convolutional Neural Network 3 (CNN), may be generated by additionally performing a voxelization of the optional method step SI 02.
The CNN 3 is configured to generate on the basis of the current tensor epc(ti) the current output data OUT(ti), on the basis of which a navigation process of the ego-vehicle may be performed.
The CNN 3 shown in Figure 9 may correspond to any CNN 3 of Figures 2, 3A, 3B, 4, 5 and 6. The above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 9.
Optionally, in a method step SI 03, the current output data OUT(ti) may be decoded and a non-maximum suppression may be performed thereon. On the basis of the processing result of the method step SI 03 detected object boxes may be provided. The size of the data pc(ti) being a point cloud is “M * 4“, wherein “M” is the number of points in the point cloud and the “4” indicates that each point in the point cloud comprises three spatial dimensions (point cloud with three spatial dimensions) and one attribute, such as a scalar reflection brightness attribute.
The size of the current location data px,, pyi is two (“2”) because the location of the ego- vehicle in the local navigational frame, being a 2-dimensional coordinate system according to the embodiment of Figure 9, is described by two coordinates (x and y coordinate).
The size of the current tensor epc(ti) being an encoded point cloud is “C*H*W". “C” is the number of channels of the current tensor epc(ti) (number of channels in the encoded point cloud), wherein C is greater than or equal to one (C > 1). Ή” is the size of the current tensor epc(ti) along the y coordinate of the local navigational frame and “W” is the size of the current tensor along the x coordinate of the local navigational frame. “H” and “W” define the spatial resolution of the area of the region of interest (ROI) of the grid of the local navigational frame. For example, “H” and “W” may each be 280 cells of the grid of the local navigational frame, such that the ROI corresponds to an area of 140 m x 140 m, in case the cells of the regular grid of the local navigational frame each are of size 0.5m x 0.5 m.
The size of the output data OUT(ti) is “C2*H*W”, wherein “H” and “W” are described as above.”C2” is the number of channels of the output data OUT(ti) that may be the same or different than the number of channels (“C”) of the current tensor epc(ti).
“B” is the number of detected objects and “S” is the size of the metadata related to one object.
The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A device (1) for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor (2) configured to measure distance, wherein the device (1) is configured to employ a Convolutional Neural Network (3), CNN, in an inference phase, the CNN comprising a first layer (LI) and one or more further layers (L2a, L2b, L2c) following the first layer (LI) and one or more first buffers (4a) for storing an output tensor of a respective preceding layer (LI ; L2a; L2b), input data (pc(ti)) of a current scan provided by the 2D or 3D sensor (2) in the form of a current tensor (epc(ti)) into the CNN (3), perform, at each further layer (L2a; L2b; L2c), one or more convolutional operations on the basis of a current output tensor (al(ti); a2(ti); a3(ti)) of the preceding layer (LI ; L2a, L2b) originating from the data (pc(t )) of the current scan and a previous output tensor (al(ti-i); a2(ti-i); a3(ti-i)) of the preceding layer (LI; L2a; L2b), the previous output tensor (al(ti-i); a2(ti-i); a3 (ti-i)) of the preceding layer (LI ; L2a; L2b) being a newest tensor stored in the respective first buffer (4a) of the one or more first buffers (4a) and originating from data of a previous scan that was input to the CNN directly before the data (pc(ti)) of the current scan, and store the current output tensor (al (f); a2(t,); a3(t)) of the preceding layer (L 1 ; L2a; L2b) as the newest tensor in the respective first buffer (4a).
2. The device (1) according to claim 1, wherein the CNN (3) comprises a second buffer (4b) for storing a tensor (epc(ti-i), epc(ti)) input to the CNN (3), and the device (1) is configured to perform, at the first layer (LI), one or more convolutional operations on the basis of the current tensor (epc(ti)) and a previous tensor (epc(ti-i)) generating a current output tensor (al(ti)) of the first layer (LI) originating from the data (pc(ti)) of the current scan, the previous tensor (epc(ti-i)) being the newest tensor stored in the second buffer (4b) and corresponding to the data of the previous scan, and store the current tensor (epc(ti)) as the newest tensor in the second buffer
(4b).
3. The device (1) according to claim 1 or 2, wherein each buffer (4a, 4b) is a serial-in parallel-out buffer configured to store a new tensor as newest tensor and, in case the buffer is full, to simultaneously drop the oldest tensor stored in the buffer.
4. The device (1) according to any one of the preceding claims, wherein the temporal size of each of the one or more first buffers (4a) for storing the output tensor of a respective preceding layer (LI, L2a, L2b) is one less than the temporal size of a convolutional kernel of the respective preceding layer (LI, L2a, L2b), and in particular one or more buffers (4a, 4b) have a different temporal size.
5. The device (1) according to any one of the preceding claims, wherein each tensor is a tensor with two or more dimensions, in particular with one or more spatial dimensions and one channel dimension.
6. The device (1) according to any one of the preceding claims, wherein each tensor is a tensor with two dimensions, in particular with one spatial dimension and one channel dimension; a tensor with three dimensions, in particular with two spatial dimensions and one channel dimension; or a tensor with four dimensions, in particular with three spatial dimensions and one channel dimension.
7. The device (1) according to any one of the preceding claims, the channel dimension of each first buffer (4a) for storing the output tensor of a respective preceding layer (LI, L2a, L2b) corresponds to the channel dimension of the respective preceding layer (LI, L2a, L2b).
8. The device (1) according to any one of the preceding claims configured to generate output data for a navigation process of an ego-vehicle (5), on which the 2D or 3D sensor (2) is arranged, wherein the device (1) is configured to input, together with the data pc(ti) of the current scan, current location data (pxi, py of the ego-vehicle in a grid (G) of a local navigational frame (LNF) to the CNN (3), and pad and crop (S83) in each buffer (4a, 4b) the newest tensor being stored, in case the current location data (pxi, pyO of the ego-vehicle do not match previous location data (pxi-i, pyi-i) inputted together with the data of the previous scan.
9. The device (1) according to claim 8, wherein the device is configured to store the location data in a location field of one or more of the one or more first buffers (4a) and the second buffer (4b), and update (S84) the location field of the respective one or more buffers with the current location data in case the current location data (pxi, pyO of the ego-vehicle do not match the previous location data (pxi-i, pyi-i).
10. The device (1) according to claim 8 or 9, wherein the device ( 1 ) is configured to generate (S 101 ) on the basis of the data (pc(ti))of the current scan and the current location data (pxi, pyO of the ego-vehicle (5) in the grid (G) of the local navigational frame (LNF) the current tensor input (epc(ti)) to the CNN (3) by performing a change of coordinates of the data (pc(ti)) of the current scan into the local navigational frame (LNF).
11. The device (1) according to claim 10, wherein the device ( 1 ) is configured to generate on the basis of the data (pc(ti)) of the current scan and the current location data (pxi, pyO of the ego-vehicle (5) in the grid (G) of the local navigational frame (LNF) the current tensor input (epc(ti)) to the CNN (3) by additionally performing a voxelization (SI 02).
12. The device (1) according to any one of the preceding claims, wherein the device (1) is configured to detect targets, in particular moving targets, in the vicinity of the 2D or 3D sensor (2), perform a point cloud semantic segmentation, and/or perform a free space estimation on the basis of the output tensor of the last layer of the one or more further layers of the CNN (3).
13. A device comprising a Convolutional Neural Network (3), CNN, for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor (2) configured to measure distance, the CNN (3) comprising a first layer (LI) and at least one or more further layers (L2a, L2b, L2c) following the first layer (Li) and one or more first buffers (4a) for storing an output tensor of a respective preceding layer (LI; L2a; L2b), wherein the hardware implementation of the CNN is configured to input data (pc(ti)) of a current scan provided by the 2D or 3D sensor (2) in the form of a current tensor (epc(ti)) into the CNN (3), perform, at each further layer (L2a; L2b; L2c), one or more convolutional operations on the basis of a current output tensor (al(ti);a2(ti); a3(ti)) of the preceding layer (LI ; L2a; L2b) originating from the data (pc(ti)) of the current scan and a previous output tensor (al(ti-i); a2(ti-i); a3(ti-i)) of the preceding layer (LI; L2a; L2b), the previous output tensor (al(ti-i); a2(ti-i); a3(ti-i) ) of the preceding layer (LI ; L2a; L2b) being the newest tensor stored in the respective first buffer (4a) of the one or more first buffers (4a) and originating from data of a previous scan that was input to the CNN (3) directly before the data (pc(ti)) of the current scan, and store the current output tensor (al(ti); a2(h); a3(t,)) of the preceding layer LI ; L2a; L2b) as the newest tensor in the respective first buffer (4a).
14. An ego-vehicle (5) comprising one or more 2D or 3D sensors (2) configured to measure distance, and a device (1) according to any one of claims 1 to 12, wherein the one or more 2D or 3D sensors (2) are configured to provide scans containing spatial information of the vicinity of the ego-vehicle (5) in the form of a data stream to the device (1) and the device (1) is configured to process the data stream.
15. The ego-vehicle (5) according to claim 14, wherein the device (1) is configured to control autonomous movement of the ego-vehicle (5) on the basis of the processing of the data stream.
16. A method of employing a Convolutional Neural Network (3), CNN, in an inference phase, for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor (2) configured to measure distance, the CNN (3) comprising a first layer (LI) and at least one or more further layers (L2a, L2b, L2C) following the first layer (LI) and one or more first buffers (4a) for storing an output tensor of a respective preceding layer (LI; L2a; L2b), wherein the method comprises the steps of inputting data (pc(ti)) of a current scan provided by the 2D or 3D sensor (2) in the form of a current tensor (epc(ti)) into the CNN (3), performing, at each further layer (L2a; L2b; L2c), one or more convolutional operations on the basis of a current output tensor (al(ti); a2(ti); a3(h)) of the preceding layer (LI ; L2a; L2b) originating from the data (pc(ti)) of the current scan and a previous output tensor (al(ti-i); a2(ti-i); a3(ti-i)) of the preceding layer (LI; L2a; L2b), the previous output tensor (al(ti-i); a2(ti-i); a3(ti-i)) ofthe preceding layer (LI; L2a; L2b) being the newest tensor stored in the respective first buffer (4a) of the one or more first buffers (4a) and originating from data of a previous scan that was input to the CNN (3) directly before the data of the current scan, and storing the current output tensor (al(t); a2(tj); a3(tj)) of the preceding layer (LI; L2a; L2b) as the newest tensor in the respective first buffer (4a).
17. A computer program comprising program code for performing, when implemented on a processor, a method according to claim 16.
18. A computer comprising a memory and a processor, which are configured to store and execute program code to perform the method according to claim 16.
PCT/RU2020/000256 2020-05-27 2020-05-27 Processing a data stream of scans containing spatial information provided by a 2d or 3d sensor configured to measure distance by using a convolutional neural network (cnn) WO2021242132A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/RU2020/000256 WO2021242132A1 (en) 2020-05-27 2020-05-27 Processing a data stream of scans containing spatial information provided by a 2d or 3d sensor configured to measure distance by using a convolutional neural network (cnn)
CN202080101444.8A CN115836299A (en) 2020-05-27 2020-05-27 Processing a scanned data stream comprising spatial information provided by a 2D or 3D sensor for measuring distance using a Convolutional Neural Network (CNN)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2020/000256 WO2021242132A1 (en) 2020-05-27 2020-05-27 Processing a data stream of scans containing spatial information provided by a 2d or 3d sensor configured to measure distance by using a convolutional neural network (cnn)

Publications (1)

Publication Number Publication Date
WO2021242132A1 true WO2021242132A1 (en) 2021-12-02

Family

ID=72234895

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2020/000256 WO2021242132A1 (en) 2020-05-27 2020-05-27 Processing a data stream of scans containing spatial information provided by a 2d or 3d sensor configured to measure distance by using a convolutional neural network (cnn)

Country Status (2)

Country Link
CN (1) CN115836299A (en)
WO (1) WO2021242132A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180364717A1 (en) * 2017-06-14 2018-12-20 Zoox, Inc. Voxel Based Ground Plane Estimation and Object Segmentation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180364717A1 (en) * 2017-06-14 2018-12-20 Zoox, Inc. Voxel Based Ground Plane Estimation and Object Segmentation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MENGWEI XU ET AL: "DeepCache: Principled Cache for Mobile Deep Vision", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 30 March 2020 (2020-03-30), XP081630306, DOI: 10.1145/3241539.3241563 *

Also Published As

Publication number Publication date
CN115836299A (en) 2023-03-21

Similar Documents

Publication Publication Date Title
US11720799B2 (en) Object detection neural networks
US11210537B2 (en) Object detection and detection confidence suitable for autonomous driving
JP7254823B2 (en) Neural networks for object detection and characterization
CN112292711B (en) Associating LIDAR data and image data
JP7239703B2 (en) Object classification using extraterritorial context
US10565721B2 (en) Information processing device and information processing method for specifying target point of an object
CN111695717A (en) Temporal information prediction in autonomous machine applications
US11727668B2 (en) Using captured video data to identify pose of a vehicle
EP3709134A1 (en) Tool and method for annotating a human pose in 3d point cloud data
WO2018229549A2 (en) System and method for digital environment reconstruction
KR20200022001A (en) Rare Instance Classifiers
US11327178B2 (en) Piece-wise network structure for long range environment perception
KR102095842B1 (en) Apparatus for Building Grid Map and Method there of
CN110826386A (en) LIDAR-based object detection and classification
US20200057160A1 (en) Multi-object tracking based on lidar point cloud
KR101628155B1 (en) Method for detecting and tracking unidentified multiple dynamic object in real time using Connected Component Labeling
CN110969064A (en) Image detection method and device based on monocular vision and storage equipment
CN116310673A (en) Three-dimensional target detection method based on fusion of point cloud and image features
EP3555854B1 (en) A method of tracking objects in a scene
Laflamme et al. Driving datasets literature review
CN111833443A (en) Landmark position reconstruction in autonomous machine applications
WO2021242132A1 (en) Processing a data stream of scans containing spatial information provided by a 2d or 3d sensor configured to measure distance by using a convolutional neural network (cnn)
EP3588136B1 (en) Object representation and classification based on vehicle sensor detections
US20240151855A1 (en) Lidar-based object tracking
CN114648639A (en) Target vehicle detection method, system and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20761347

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20761347

Country of ref document: EP

Kind code of ref document: A1