WO2021242132A1

WO2021242132A1 - Processing a data stream of scans containing spatial information provided by a 2d or 3d sensor configured to measure distance by using a convolutional neural network (cnn)

Info

Publication number: WO2021242132A1
Application number: PCT/RU2020/000256
Authority: WO
Inventors: Dmitrii Akhmirovich KHIZBULLIN; Mikhail Viktorovich PIKHLETSKY; Sergey Valerevich MOROZOV; Xinli HAN; Zuguang WU; Peng Zhou
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2021-12-02
Also published as: CN115836299A

Abstract

The present invention relates to a device (1) for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor (2) configured to measure distance, wherein the device (1) is configured to employ a Convolutional Neural Network (3), CNN, in an inference phase. The CNN comprises a first layer (L1) and one or more further layers (L2a, L2b) following the first layer (L1) and one or more first buffers (4a) for storing an output tensor of a respective preceding layer (L1; L2). The device (1) is further configured to input data of a current scan provided by the 2D or 3D sensor (2) in the form of a current tensor (epc(t_i)) into the CNN (3); and perform, at each further layer (L2a; L2b), one or more convolutional operations on the basis of a current output tensor (a1(t_i); a2(t_i)) of the preceding layer (L1; L2a) originating from the data of the current scan and a previous output tensor (al (t_i _-1); a2(t_i _-1)) of the preceding layer (L1, L2a). The previous output tensor (al(t_i _-1), a2(t_i _-1)) of the preceding layer (L1, L2a) is the newest tensor stored in the respective first buffer (4a) of the one or more first buffers (4a) and originates from data of a previous scan inputted to the CNN directly before the data of the current scan. Furthermore, the device (1) is configured to store the current output tensor (a1(t_i), (a2(t_i)) of the preceding layer (L1, L2a) as the newest tensor in the respective first buffer (4a).

Description

PROCESSING A DATA STREAM OF SCANS CONTAINING SPATIAL INFORMATION PROVIDED BY A 2D OR 3D SENSOR CONFIGURED TO MEASURE DISTANCE BY USING A CONVOLUTIONAL NEURAL NETWORK

(CNN)

TECHNICAL FIELD

The present disclosure relates to a device for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance, wherein the device is configured to employ a Convolutional Neural Network (CNN) in an inference phase. The present disclosure further relates to an ego-vehicle comprising such a device and one or more 2D or 3D sensors configured to measure distance. The present disclosure furthermore relates to a hardware implementation of a Convolutional Neural Network (CNN) for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. Moreover, the present disclosure relates to a method of employing a Convolutional Neural Network (CNN) in an inference phase for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. Additionally, the present disclosure relates to a computer program comprising program code for performing such a method, to a non- transitory storage medium storing executable program code which, when executed by a processor, causes such a method to be performed, and to a computer comprising a memory and a processor, which are configured to store and execute program code to perform such a method.

BACKGROUND

A data stream of scans containing spatial information provided by a 2D or 3D distance sensor can be processed by means of a neural network. The spatial information is spatial information about an environment or vicinity of the 2D or 3D sensor. Objects in the environment of the 2D or 3D sensor can be detected on the basis of a processing result of the processing of the data stream. For example, the 2D or 3D sensor may be installed on an ego-vehicle. In this case, the spatial information provided by the 2D or 3D sensor corresponds to spatial information about the environment of the ego-vehicle and, thus, on the basis of the processing result of the processing of the data stream objects, such as cars, pedestrian, bicyclists, motorcyclist etc., in the environment of the ego-vehicle may be detected.

The term “ego vehicle ” refers to a vehicle that is equipped with one or more sensors (e.g. one or more 2D or 3D distance sensors) for sensing an environment of the vehicle and which operates based on data from those sensors and not necessarily based on any other data about its environment. In other words, an ego vehicle operates based on its own “view” of its environment.

SUMMARY

Embodiments of the invention base also on the following considerations made by the inventors:

In order to process a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance a Convolutional Neural Network (CNN) may be used.

Figure 1 shows an example of an operation of a Convolutional Neural Network 3 (CNN) at a current time point ti.

The term “present” may be used as a synonym for the term “current”. That is, e.g. the term “present time point” may be used as a synonym for the term “current time point”.

The CNN 3 according to Figure 1 comprises four layers LI, L2a, L2b and L2c, wherein at each layer one or more convolutional operations are performed.

At a current time point ti, data pc(ti) of a current scan are provided by the 2D or 3D sensor and one or more convolutional operations are performed at each layer of the CNN 3 on the basis of a current tensor originating from the data pc(ti) of the current scan (scan provided at time point ti) and a previous tensor originating from data pc(t-i) of a previous scan (scan provided at time point ti-i) inputted to the CNN 3 directly before the data of the current scan, wherein the current tensor and the previous tensor are provided to the respective layer.

The one or more convolutional operations of a layer are indicated in Figure 1 by two arrows originating from the respective two tensors on the basis of which the one or more convolutional operations are performed.

For example, as shown in Figure 1, at the current time point tj the data pc(ti) of the scan of the time point ti is provided as the data of the current scan by the 2D or 3D sensor and the tensor epc(ti) originating from the data pc(ti) of the current scan is input to the CNN 2 and, thus, is provided to the first layer LI. At the same time the previous tensor epc(ti-i) originating from the data pc(ti-i) of the previous scan provided by the 2D or 3D sensor at the previous time point ti-i directly before the current scan of the time point ti, is provided to the first layer LI, wherein at the first layer LI one or more convolutional operations of the first layer LI are performed on the basis of the current tensor epc(ti) and the previous tensor epc(ti-i). In order to be able to use at the current time point ti the previous tensor epc(ti-i) for the one or more convolutional operations of the first layer LI, the previous tensor epc(ti-i) has to be generated again respectively has to be re-computed at the current time point ti. In the case of the first layer LI this requires performing a voxelization VX on the basis of the data pc(ti-i) of the previous scan. The voxelization VX is indicated in Figure 1 by a single arrow originating from the respective data on the basis of which the voxelization is performed.

However, in case of performing the one or more convolutional operations of a layer deeper in the CNN 3 respectively further away from the first layer LI, e.g. of the last layer L2c of the CNN 3, besides performing one or more convolutional operations at each preceding layer LI, L2a and L2b on the basis of the respective current tensor provided to the respective layer at the current time point ti (current tensor epc(ti) provided to the first layer LI, current tensor al(ti) being an output tensor of the first layer LI and provided to the second layer L2a, current tensor a2(ti) being an output tensor of the second layer L2a and provided to the third layer L2b, the following re-computation has to be done: a voxelization VX of data pc(ti-i), pc(ti-2), pc(ti-3) and pc(ti-4) of four directly subsequent previous scans that were provided by the 2D or 3D sensor directly before the current scan of the time point ti has to be re-computed, and a plurality of one or more convolutional operations of respective layers LI, L2a, L2b on the basis of previous tensors originating from the previous data pc(ti-i), pc(ti- 2), pc(ti-3) and pc(ti-zt) has to be re-computed for processing the tensors epc(ti), al(ti), a2(ti) and a3(ti) originating from the data pc(ti) of the current scan in order to generate the output tensor a4(ti) of the CNN 3 shown in Figure 1.

The term “ consecutive ” may be used as a synonym for the term “ directly subsequent ".

According to Figure 1, the CNN 3 has four layers LI, L2a, L2b and L2c. This is only an example. According to an embodiment of the invention, the CNN 3 may have more than four layers. The greater the number of layers of the CNN the greater the number of the above mentioned re-processing respectively re-computation steps, which have to be performed on the basis of previous data of previous scans when processing the data of a current scan at a current time point tj using the CNN 3.

As a result, using a CNN for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance has the disadvantage of requiring a high amount of computational resources. This is especially the case, when the 2D or 3D sensor provides scans at a frame rate of at least 10 to 20 frames per second. The term “frame" and “scan" may be used as synonyms.

In view of the above-mentioned problems and disadvantages, embodiments of the present invention aim to improve the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. An objective is to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance.

The objective is achieved by the embodiments of the invention as described in the enclosed independent claims. Advantageous implementations of the embodiments of the invention are further defined in the dependent claims.

A first aspect of the present disclosure provides a device for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance, wherein the device is configured to employ a Convolutional Neural Network (CNN) in an inference phase. The CNN comprises a first layer and one or more further layers following the first layer and one or more first buffers for storing an output tensor of a respective preceding layer. The device is configured to input data of a current scan provided by the 2D or 3D sensor in the form of a current tensor into the CNN. That is, the device is configured to input the data of the current scan in the form of the current tensor into the CNN. Further, the device is configured to perform, at each further layer, one or more convolutional operations on the basis of a current output tensor of the preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer. That is, the device is configured to perform, at each further layer, one or more convolutional operations on the basis of the current output tensor of the preceding layer and the previous output tensor of the preceding layer. The previous output tensor of the preceding layer is the newest tensor stored in the respective first buffer of the one or more first buffers and originates from data of a previous scan inputted to the CNN directly before the data of the current scan. Furthermore, the device is configured to store the current output tensor of the preceding layer as the newest tensor in the respective first buffer.

The device according to the first aspect allows to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. Since the CNN comprises one or more first buffers for storing the output tensor of a respective preceding layer, the need of re-computation of one or more previous tensors, in particular output tensors, of respective preceding layers for performing one or more convolutional operations at each further layer at a current time point is overcome. Namely, the device is configured to perform, at each further layer, one or more convolutional operations on the basis of the current output tensor of the preceding layer originating from the data of the current scan and the previous output tensor of the preceding layer being the newest tensor stored in the respective first buffer. As a result of using one or more first buffers the inference time of the CNN in the inference phase is reduced. The inference time corresponds to the time required by the CNN for providing output data (the output tensor of the last layer of the CNN) starting from the current tensor originating from the data of the current scan and being input into the CNN at the current time. In particular, the computational costs may be reduce from K²/2 (i.e. 0(K²)), in case of no buffers being used (as it is the case in the CNN of Figure 1), to K (i.e. O(K)), in case of the CNN of the device according to the first aspect comprising buffers. K is the number of aggregated scans respectively sweeps provided by the 2D or 3D sensor. In case the 2D or 3D sensor is a LIDAR sensor, K is the number of aggregated LIDAR sweeps. Therefore, the device according to the first aspect allows a real time inference for high values of K, e.g. between 10 and 100 scans.

The passage “a tensor input into the CNN ” may be understood as “a tensor that is input into the CNN \ The passage “a tensor input to the CN ’ may be used as a synonym for the passage “a tensor input into the CNN’.

In particular, the first layer and the one or more further layers of the CNN are the layers of the CNN. Descriptions made herein with respect to a layer referred to by the general term “layer” are valid for the first layer as well as the one or more further layers.

In an implementation form of the first aspect, the CNN comprises one or more optional additional layers, wherein the device is configured to perform, at each optional additional layer, one or more convolutional operations on the basis of a current output tensor of a preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer originating from data of a previous scan inputted to the CNN directly before the data of the current scan. In particular, the first layer, the one or more further layers and the one or more optional additional layers of the CNN are the layers of the CNN. Descriptions made herein with respect to a layer referred to by the general term “layer” may also be valid for the one or more optional additional layers.

The 2D or 3D sensor configured to measure distance may comprise or correspond to one or more LIDAR sensors (light detection and ranging sensors), TOF cameras (time of flight cameras), stereo cameras and/or beamforming radars. That is, the 2D or 3D sensor may comprise or correspond to one or more visual-depth-capable sensors. The 2D or 3D sensor may provide scans at a frame rate of at least 10 to 20 frames per second.

The term “2D sensor configured to measure distance” may be understood to correspond to a sensor that is configured to detect scans with two spatial dimensions (2 -dimensional scans respectively frames) and to measure distance. The term “2D sensor configured to measure distance” may be understood to correspond to a sensor that is configured to detect scans with three spatial dimensions (3-dimensional scans respectively frames) and to measure distance.

In particular, the CNN is a feed-forward neural network that may be described with a directed acyclic graph (DAG). An advantage of a DAG neural network is that it has a finite impulse response operator and relates to finite impulse response filters (FIR filters).

In particular, a buffer is configured to buffer respectively store data, such as tensors. A buffer may be a data structure for buffering data, in particular tensors. The terms “ buffer storage ”, “ rolling buffer ” and “ rolling buffer storage ” may be used to refer to a buffer.

In an implementation form, at one or more of the one or more further layers of the CNN, besides the one or more convolutional operations, one or more optional further operations, such as one or more normalization operations and/or one or more activation operations, may be performed.

The terms “ activation of a layer’^’’ or “ layer output ” may be used as synonyms for the output tensor of a layer.

In an implementation form of the first aspect, the CNN comprises a second buffer for storing a tensor input to the CNN, and the device is configured to perform, at the first layer, one or more convolutional operations on the basis of the current tensor and a previous tensor generating a current output tensor of the first layer originating from the data of the current scan. The previous tensor is the newest tensor stored in the second buffer and corresponds to the data of the previous scan. Further, the device is configured to store the current tensor as the newest tensor in the second buffer.

Therefore, the device according to the first aspect allows to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. Since the CNN comprises a second buffer for storing the tensor input to the CNN, the need of recomputation of one or more previous tensors input to the CNN on the basis of the corresponding data of the corresponding previous scan for performing one or more convolutional operations at the first layer at a current time point is overcome. Namely, the device is configured to perform, at the first layer, one or more convolutional operations on the basis of the current tensor and the previous tensor being the newest tensor stored in the second buffer.

Descriptions made herein with respect to a buffer referred to by the general term “ buffer ” are valid for the one or more first buffers as well as for the second buffer.

In particular, the number of first buffers of the CNN corresponds to the number of further layers of the CNN. In particular, the number of buffers of the CNN corresponds to the number of one or more first buffers and the second buffer.

In an implementation form of the first aspect, each buffer is a serial-in parallel-out buffer configured to store a new tensor as newest tensor and, in case the buffer is full, to simultaneously drop the oldest tensor stored in the buffer.

In particular, each buffer is a serial-in parallel-out (SIPO) shift register.

In an implementation form of the first aspect, the temporal size of each of the one or more first buffers for storing the output tensor of a respective preceding layer is one less than the temporal size of a convolutional kernel of the respective preceding layer.

In particular, one or more buffers have a different temporal size.

The term “ convolutional matrix'^' may be used to refer to a convolutional kernel. The convolutional kernel is used at a layer of the CNN for performing the one or more convolutional operations of the layer.

The temporal size of a first buffer corresponds to the number of consecutive respectively directly subsequent time points for which the first buffer is configured to store the output tensor of the respective layer (preceding layer). Therefore, the temporal size may also be defined in terms of the number of directly subsequent scans of the 2D or 3D sensor for which the first buffer may store the output tensor of the respective layer (preceding layer). For example, if the temporal size of a first buffer is one (“7”) then the first buffer may only store one output tensor of a respective layer originating from data of one scan of one time point. This reduces the storage consumption, such as RAM consumption, to a minimum. Therefore, in this case, when storing the current output tensor (originating from data of a current scan of a current time point) of a layer (preceding layer) as the newest tensor in the respective first buffer, the previous output tensor of the layer (originating from data of the previous scan inputted to the CNN directly before the data of the current scan) already stored in the respective first buffer is dropped respectively deleted, because the first buffer is already full. For example, if the temporal size of a first buffer is three (“3”) then the first buffer may store three output tensors of a respective layer (preceding layer) originating from data of three consecutive scans of three consecutive time points.

Therefore, the temporal size of a buffer corresponds to the number of directly subsequent time points for which the buffer may store a tensor. Accordingly, the temporal size of a buffer corresponds to the number of directly subsequent scans of the 2D or 3D sensor for which the first buffer may store a corresponding tensor.

The term ‘‘‘consecutive” may be used as a synonym for the term “ directly subsequent”.

In an implementation form of the first aspect, each tensor is a tensor with two or more dimensions.

In particular, each tensor is a tensor with one or more spatial dimensions and one channel dimension. The number of spatial dimensions of a tensor may equal to the spatial dimensions of the data of a corresponding scan provided by the 2D or 3D sensor configured to measure distance from which the tensor originates from.

The channel dimension corresponds to the number of channel and is greater than or equal to one (channel dimension > 1).

In particular, the data of a scan (e.g. current scan) corresponds to a point cloud, wherein a point cloud is a set of points in an N-dimensional space along with their attributes, wherein N corresponds to the number of spatial dimensions. In particular, the number N of spatial dimensions may correspond to one spatial dimension, two spatial dimensions or three spatial dimensions. For example, in case the data of a scan (e.g. current scan) corresponds to a point cloud with two spatial dimensions, then the data with the two spatial dimensions may correspond to a bird’s eye view (BEV) representation of a point cloud. In case the data of a scan (e.g. current scan) corresponds to a point cloud with three spatial dimensions, then the data with the three spatial dimensions may correspond to a volumetric representation of a point cloud, e.g. for a flying capable vehicle, such as a flying capable robot, drones, aircraft etc.

For example, a point cloud with three spatial dimensions may be produced by a 2D or 3D sensor configured to measure distance, such as a LIDAR sensor, wherein a wave or a beam, such as a light beam, bounces back from an obstacle in the environment of the 2D or 3D sensor to produce a point with three spatial dimensions with its location in meters and a scalar reflection brightness attribute. Such a point cloud comprising a plurality, e.g. thousands, of points with three spatial dimensions and a scalar reflection brightness attribute may correspond to a tensor with four dimensions. The four dimensions corresponds to the three spatial dimensions and one channel dimension for the scalar reflection brightness attribute.

In an implementation form of the first aspect, each tensor is a tensor with two dimensions, in particular with one spatial dimension and one channel dimension. Alternatively, each tensor is a tensor with three dimensions, in particular with two spatial dimensions and one channel dimension. Alternatively, each tensor is a tensor with four dimensions, in particular with three spatial dimensions and one channel dimension.

In an implementation form of the first aspect, the channel dimension of each first buffer for storing the output tensor of a respective preceding layer corresponds to the channel dimension of the respective preceding layer.

In an implementation form of the first aspect, the device is configured to generate output data for a navigation process of an ego-vehicle, on which the 2D or 3D sensor is arranged, wherein the device is configured to input, together with the data of the current scan, current location data of the ego-vehicle in a grid of a local navigational frame to the CNN, and pad and crop in each buffer the newest tensor being stored, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan. The term “ ego-vehicle ” may be understood as a mobile platform, in particular mobile robotic platform, bearing one or more sensors, such as one or more 2D or 3D sensors configured to measure distance, that computes respectively operates from the perspective of which the world respectively environment is perceived.

The ego-vehicle may correspond to a vehicle, such as a car, truck, motorcycle etc., an autonomous vehicle, such as an autonomous car, autonomous truck etc., a robot, such as a delivery robot, an autonomous robot, such as an autonomous delivery robot, a flying capable vehicle, such as a flying capable drone, flying capable robot, aircraft etc., or an autonomous flying capable vehicle, such as an autonomous flying capable drone, an autonomous flying capable robot, autonomous aircraft etc.

The ego-vehicle may comprise a localization unit configured to determine the current location of the ego-vehicle and, thus, the current location data of the ego-vehicle. The localization unit may be configured for a short-term localization, e.g. at the scope of Is to 1 Os. The localization unit may comprise or correspond to one or more inertial measurement units (IMU). In particular, the localization unit is configured to perform an odometry process, such as an inertial odometry process, a wheel odometry process and/or an optical respectively visual odometry process etc. An odometry process is a process of understanding an ego location, i.e. the location of the ego-vehicle, based on sensory (e.g. wheel, inertial) information.

The term “ local navigational frame” may be understood as a coordinate system tied to the ground. In particular, the local navigational frame may correspond to a 2-dimensional coordinate system with top-down view on the ground surface (as shown in Figure 7).

In particular, the grid of the local navigational frame is a regular grid composed of a plurality of cells. The term “ local navigational coordinate frame ” may be used as a synonym for the local navigational frame.

In particular, the device is configured to zero-pad and crop in each buffer the newest tensor being stored, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan.

In particular, the grid of the local navigational frame is a regular grid composed of a plurality of cells and in case the ego-vehicle moves within the same cell of the grid, there is no padding and cropping performed by the device. In other words, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan, but the current location data and the previous location data of the ego-vehicle lie respectively are located within the same cell of the grid, the device does not perform padding and cropping.

In an implementation form of the first aspect, the device is configured to control autonomous movement of the ego-vehicle on the basis of the processing of the data stream.

In an implementation form of the first aspect, the device is configured to store the location data in a location field of one or more of the one or more first buffers and the second buffer, and update the location field of the respective one or more buffers with the current location data in case the current location data of the ego-vehicle do not match the previous location data.

In particular, the grid of the local navigational frame is a regular grid composed of a plurality of cells and in case the ego-vehicle moves within the same cell of the grid, there is no updating of the location field of the respective one or more buffers. In other words, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan, but the current location data and the previous location data of the ego-vehicle lie respectively are located within the same cell of the grid, the device does not update the location field of the respective one or more buffers with the current location data. In an implementation form of the first aspect, the device is configured to generate on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by performing a change of coordinates of the data of the current scan into the local navigational frame.

In an implementation form of the first aspect, the device is configured to generate on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by additionally performing a voxelization.

In an implementation form of the first aspect, the device is configured to generated on the basis of the data of the current scan the current tensor input to the CNN by performing a voxelization.

In particular, the data of the current scan corresponds to a point cloud, wherein a point cloud is a set of points in an N-dimensional space along with their attributes, wherein N corresponds to the number of spatial dimensions. Thus, the device may be configured to generate on the basis of the point cloud of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by performing a change of coordinates of the point cloud of the current scan into the local navigational frame.

The current tensor input to the CNN may correspond to an encoded point cloud. An encoded point cloud may be understood as the result of the transformation of a raw respectively unordered point cloud into a voxelized respectively ordered format.

Voxelization is a transformation of an unordered set of points, such as the unordered points of a point cloud into a regular grid. In particular, voxelization is a transformation of an unordered set of points with N spatial dimensions, such as the unordered points of a point cloud with N spatial dimensions, into a regular N-dimensional grid. Voxelization may be performed by pillar encoding or voxel feature encoding. In particular, the number N of spatial dimensions may correspond to one spatial dimension, two spatial dimensions or three spatial dimensions. In an implementation form of the first aspect, the device is configured to detect targets, in particular moving targets, in the vicinity of the 2D or 3D sensor on the basis of the output tensor of the last layer of the one or more further layers of the CNN. Alternatively or additionally, the device is configured to perform a point cloud semantic segmentation on the basis of the output tensor of the last layer of the one or more further layers of the CNN. Alternatively or additionally, the device is configured to perform a free space estimation on the basis of the output tensor of the last layer of the one or more further layers of the CNN.

In particular, the device is configured to detect weak target objects in the vicinity of the 2D or 3D sensor on the basis of the output tensor of the last layer of the one or more further layers of the CNN. A weak target object is an object with low spatial information but significant temporal information. In particular, a weak target object is a moving object, such as a car, pedestrian including a child, bicyclist including a child, motorcyclist, etc., that has few points, in particular between 1 to 5 points, more particular between 1 to 10 points, detectable by the 2D or 3D sensor. For example, in case the 2D or 3D sensor is a LIDAR sensor, a weak target object is an object that has few LIDAR echo points, in particular 1 to 5 LIDAR echo points, more particular between 1 to 10 LIDAR echo points. In fog, rain and snow as well as in a case of heavy occlusions even normally well detectable or visible (by the 2D or 3D sensor such as a LIDAR sensor) objects may become weak target objects.

In an implementation form of the first aspect, in case the 2D or 3D sensor is a LIDAR sensor a weak target object may be defined as a moving object, e.g. from a list of known object categories (classes), that has between 1 and 5, in particular between 1 and 10, LIDAR echo points falling on it.

In order to achieve the device according to the first aspect of the present disclosure, some or all of the implementation forms and optional features of the first aspect, as described above, may be combined with each other.

A second aspect of the present disclosure provides a hardware implementation of a Convolutional Neural Network (CNN) for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. The CNN comprises a first layer and at least one or more further layers following the first layer and one or more first buffers for storing an output tensor of a respective preceding layer. The hardware implementation of the CNN is configured to input data of a current scan provided by the 2D or 3D sensor in the form of a current tensor into the CNN. Further, the hardware implementation of the CNN is configured to perform, at each further layer, one or more convolutional operations on the basis of a current output tensor of the preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer. The previous output tensor of the preceding layer is the newest tensor stored in the respective first buffer of the one or more first buffers and originates from data of a previous scan inputted to the CNN directly before the data of the current scan. The hardware implementation of the CNN is configured to store the current output tensor of the preceding layer as the newest tensor in the respective first buffer.

In an implementation form of the second aspect, the CNN comprises one or more optiohal additional layers, wherein the hardware implementation of the CNN is configured to perform, at each optional additional layer, one or more convolutional operations on the basis of a current output tensor of a preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer originating from data of a previous scan inputted to the CNN directly before the data of the current scan. In particular, the first layer, the one or more further layers and the one or more optional additional layers of the CNN are the layers of the CNN.

In an implementation form of the second aspect, the CNN comprises a second buffer for storing a tensor input to the CNN, and the hardware implementation of the CNN is configured to perform, at the first layer, one or more convolutional operations on the basis of the current tensor and a previous tensor generating a current output tensor of the first layer originating from the data of the current scan. The previous tensor is the newest tensor stored in the second buffer and corresponds to the data of the previous scan. Further, the hardware implementation of the CNN is configured to store the current tensor as the newest tensor in the second buffer.

In an implementation form of the second aspect, each buffer is a serial-in parallel-out buffer configured to store a new tensor as newest tensor and, in case the buffer is full, to simultaneously drop the oldest tensor stored in the buffer. In an implementation form of the second aspect, the temporal size of each of the one or more first buffers for storing the output tensor of a respective preceding layer is one less than the temporal size of a convolutional kernel of the respective preceding layer.

In particular, one or more buffers have a different temporal size.

In an implementation form of the second aspect, each tensor is a tensor with two or more dimensions.

In particular, each tensor is a tensor with one or more spatial dimensions and one channel dimension.

In an implementation form of the second aspect, each tensor is a tensor with two dimensions, in particular with one spatial dimension and one channel dimension. Alternatively, each tensor is a tensor with three dimensions, in particular with two spatial dimensions and one channel dimension. Alternatively, each tensor is a tensor with four dimensions, in particular with three spatial dimensions and one channel dimension.

In an implementation form of the second aspect, the channel dimension of each first buffer for storing the output tensor of a respective preceding layer corresponds to the channel dimension of the respective preceding layer.

In an implementation form of the second aspect, the hardware implementation of the CNN is configured to generate output data for a navigation process of an ego-vehicle, on which the 2D or 3D sensor is arranged, wherein the hardware implementation of the CNN is configured to input, together with the data of the current scan, current location data of the ego-vehicle in a grid of a local navigational frame to the CNN, and pad and crop in each buffer the newest tensor being stored, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan.

In an implementation form of the second aspect, the hardware implementation of the CNN is configured to control autonomous movement of the ego-vehicle on the basis of the processing of the data stream. In an implementation form of the second aspect, the hardware implementation of the CNN is configured to store the location data in a location field of one or more of the one or more first buffers and the second buffer, and update the location field of the respective one or more buffers with the current location data in case the current location data of the ego- vehicle do not match the previous location data.

In an implementation form of the second aspect, the hardware implementation of the CNN is configured to generate on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by performing a change of coordinates of the data of the current scan into the local navigational frame.

In an implementation form of the second aspect, the hardware implementation of the CNN is configured to generate on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by additionally performing a voxelization.

In an implementation form of the second aspect, the hardware implementation of the CNN is configured to detect targets, in particular moving targets, in the vicinity of the 2D or 3D sensor on the basis of the output tensor of the last layer of the one or more further layers of the CNN. Alternatively or additionally, the hardware implementation of the CNN is configured to perform a point cloud semantic segmentation on the basis of the output tensor of the last layer of the one or more further layers of the CNN. Alternatively or additionally, the hardware implementation of the CNN is configured to perform a free space estimation on the basis of the output tensor of the last layer of the one or more further layers of the CNN.

The hardware implementation of the CNN of the second aspect and its implementation forms and optional features achieve the same advantages as the device of the first aspect and its respective implementation forms and respective optional features. The implementation forms and optional features of the device according to the first aspect are correspondingly valid for the hardware implementation of the CNN according to the second aspect.

In order to achieve the hardware implementation of the CNN according to the second aspect of the present disclosure, some or all of the implementation forms and optional features of the second aspect, as described above, may be combined with each other.

A third aspect of the present disclosure provides an ego-vehicle comprising one or more 2D or 3D sensors configured to measure distance, and a device according to the first aspect or any implementation form thereof. The one or more 2D or 3D sensors are configured to provide scans containing spatial information of the vicinity of the ego-vehicle in the form of a data stream to the device and the device is configured to process the data stream.

The term “ ego-vehicle ” may be understood as a mobile platform, in particular mobile robotic platform, bearing one or more sensors, such as one or more 2D or 3D sensors configured to measure distance, that computes respectively operates from the perspective of which the world respectively environment is perceived.

The ego-vehicle may correspond to a vehicle, such as a car, truck, motorcycle etc., an autonomous vehicle, such as an autonomous car, autonomous truck etc., a robot, such as a delivery robot, an autonomous robot, such as an autonomous delivery robot, a flying capable vehicle, such as a flying capable drone, flying capable robot, aircraft etc., or a autonomous flying capable vehicle, such as an autonomous flying capable drone, an autonomous flying capable robot, autonomous aircraft etc.

The ego-vehicle may comprise a localization unit configured to determine the current location of the ego-vehicle and, thus, the current location data of the ego-vehicle. The localization unit may be configured for a short-term localization, e.g. at the scope of Is to 10s. The localization unit may comprise or correspond to one or more inertial measurement units (IMU). In particular, the localization unit is configured to perform an odometry process, such as an inertial odometry process, a wheel odometry process and/or an optical respectively visual odometry process etc.

The one or more 2D or 3D sensors configured to measure distance may each comprise or correspond to one or more LIDAR sensors (light detection and ranging sensors), TOF cameras (time of flight cameras), stereo cameras and/or beamforming radars. That is, the one or more 2D or 3D sensors may each comprise or correspond to one or more visual- depth-capable sensors. The 2D or 3D sensor may provide scans at a frame rate of at least 10 to 20 frames per second.

The term “2D sensor configured to measure distance ” may be understood to correspond to a sensor that is configured to detect scans with two spatial dimensions (2 -dimensional scans respectively frames) and to measure distance. The term “3D sensor configured to measure distance ” may be understood to correspond to a sensor that is configured to detect scans with three spatial dimensions (3 -dimensional scans respectively frames) and to measure distance.

In an implementation form of the third aspect, the device is configured to control autonomous movement of the ego-vehicle on the basis of the processing of the data stream.

The ego-vehicle of the third aspect and its implementation forms and optional features achieve the same advantages as the device of the first aspect and its respective implementation forms and respective optional features.

In order to achieve the ego-vehicle according to the third aspect of the present disclosure, some or all of the implementation forms and optional features of the third aspect, as described above, may be combined with each other.

A fourth aspect of the present disclosure provides a method of employing a Convolutional Neural Network, CNN, in an inference phase, for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. The CNN comprises a first layer and at least one or more further layers following the first layer and one or more first buffers for storing an output tensor of a respective preceding layer. The method comprises the steps of inputting data of a current scan provided by the 2D or 3D sensor in the form of a current tensor into the CNN, performing, at each further layer, one or more convolutional operations on the basis of a current output tensor of the preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer, the previous output tensor of the preceding layer being the newest tensor stored in the respective first buffer of the one or more first buffers and originating from data of a previous scan inputted to the CNN directly before the data of the current scan, and storing the current output tensor of the preceding layer as the newest tensor in the respective first buffer.

In an implementation form of the fourth aspect, the CNN comprises one or more optional additional layers, wherein the method comprises the step of performing, at each optional additional layer, one or more convolutional operations on the basis of a current output tensor of a preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer originating from data of a previous scan inputted to the CNN directly before the data of the current scan.

In particular, the first layer, the one or more further layers and the one or more optional additional layers of the CNN are the layers of the CNN.

In an implementation form of the fourth aspect, the CNN comprises a second buffer for storing a tensor input to the CNN, and the method comprises the steps of performing, at the first layer, one or more convolutional operations on the basis of the current tensor and a previous tensor generating a current output tensor of the first layer originating from the data of the current scan, the previous tensor being the newest tensor stored in the second buffer and corresponding to the data of the previous scan, and storing the current tensor as the newest tensor in the second buffer. In an implementation form of the fourth aspect, each buffer is a serial-in parallel-out buffer and the method comprises the steps of storing a new tensor as newest tensor, and in case the buffer is full, simultaneously dropping the oldest tensor stored in the buffer.

In an implementation form of the fourth aspect, the temporal size of each of the one or more first buffers for storing the output tensor of a respective preceding layer is one less than the temporal size of a convolutional kernel of the respective preceding layer.

In particular, one or more buffers have a different temporal size.

In an implementation form of the fourth aspect, each tensor is a tensor with two or more dimensions.

In an implementation form of the fourth aspect, each tensor is a tensor with two dimensions, in particular with one spatial dimension and one channel dimension. Alternatively, each tensor is a tensor with three dimensions, in particular with two spatial dimensions and one channel dimension. Alternatively, each tensor is a tensor with four dimensions, in particular with three spatial dimensions and one channel dimension.

In an implementation form of the fourth aspect, the channel dimension of each first buffer for storing the output tensor of a respective preceding layer corresponds to the channel dimension of the respective preceding layer.

In an implementation form of the fourth aspect, output data for a navigation process of an ego-vehicle, on which the 2D or 3D sensor is arranged, are generated, wherein the method comprises the steps of inputting, together with the data of the current scan, current location data of the ego- vehicle in a grid of a local navigational frame to the CNN, and padding and cropping in each buffer the newest tensor being stored, in case the current location data of the ego-vehicle do not match previous location data inputted together with the data of the previous scan.

In an implementation form of the fourth aspect, the method comprises the step of controlling autonomous movement of the ego-vehicle on the basis of the processing of the data stream.

In an implementation form of the fourth aspect, the method comprises the steps of storing the location data in a location field of one or more of the one or more first buffers and the second buffer, and updating the location field of the respective one or more buffers with the current location data in case the current location data of the ego-vehicle do not match the previous location data.

In an implementation form of the fourth aspect, the method comprises the step of generating on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by performing a change of coordinates of the data of the current scan into the local navigational frame.

In an implementation form of the fourth aspect, the method comprises the step of generating on the basis of the data of the current scan and the current location data of the ego-vehicle in the grid of the local navigational frame the current tensor input to the CNN by additionally performing a voxelization.

In an implementation form of the fourth aspect, the method comprises the step of detecting targets, in particular moving targets, in the vicinity of the 2D or 3D sensor on the basis of the output tensor of the last layer of the one or more further layers of the CNN. Alternatively or additionally, the method comprises the step of performing a point cloud semantic segmentation on the basis of the output tensor of the last layer of the one or more further layers of the CNN. Alternatively or additionally, the method comprises the step of performing a free space estimation on the basis of the output tensor of the last layer of the one or more further layers of the CNN. The method of the fourth aspect and its implementation forms and optional features achieve the same advantages as the device of the first aspect and its respective implementation forms and respective optional features.

The implementation forms and optional features of the device according to the first aspect are correspondingly valid for the method according to the fourth aspect.

In order to achieve the method according to the fourth aspect of the present disclosure, some or all of the implementation forms and optional features of the fourth aspect, as described above, may be combined with each other.

A fifth aspect of the present disclosure provides a computer program comprising program code for performing the method according to the fourth aspect or any of its implementation forms.

In particular, the fifth aspect of the present disclosure provides a computer program comprising program code for performing when implemented on a processor, the method according to the fourth aspect or any of its implementation forms.

A sixth aspect of the present disclosure provides a computer program product comprising program code for performing when implemented on a processor, a method according to the fourth aspect or any implementation form thereof.

A seventh aspect of the present disclosure provides a non-transitory storage medium storing executable program code which, when executed by a processor, causes the method according to the fourth aspect or any of its implementation forms to be performed.

An eighth aspect of the present disclosure provides a computer comprising a memory and a processor, which are configured to store and execute program code to perform a method according to the fourth aspect or any implementation form thereof.

The memory may be distributed over a plurality of physical devices. A plurality of processors that co-operate in executing the program code may be referred to as a processor. It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above described aspects and implementation forms will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

Figure 1 shows an example of an operation of a Convolutional Neural Network (CNN) at a current time point ti.

Figure 2 shows a device according to an embodiment of the invention and an ego- vehicle according to an embodiment of the invention.

Figure 3A shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.

Figure 3B shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.

Figure 4 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti. Figure 5 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.

Figure 6 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.

Figure 7 shows a grid of a local navigational frame with a location of an ego-vehicle according to an embodiment of the invention for two directly subsequent time points.

Figure 8 shows a diagram of a method according to an embodiment of the invention with respect to storing tensors originating from data of a current scan of a current time point ti in buffers of a Convolutional Neural Network (CNN) according to an embodiment of the invention.

Figure 9 shows a diagram of a method of employing a Convolutional Neural Network

(CNN) in an inference phase according to an embodiment of the invention for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance.

DETAILED DESCRIPTION OF EMBODIMENTS

In the Figures 1 to 10 corresponding elements are labelled by the same reference signs.

Figure 2 shows on the left side a device 1 according to an embodiment of the invention and on the right side an ego-vehicle 5 according to an embodiment of the invention.

The above description with respect to the device according to the first aspect and its implementation forms is correspondingly valid for the device 1 of Figure 2. The above description with respect to the ego-vehicle according to the third aspect of the invention and its implementation forms is correspondingly valid for the ego-vehicle 5 of Figure 2.

As shown on the left side of Figure 2, the device 1 comprises a Convolutional Neural Network 3 (CNN) and is configured to employ the CNN in inference phase. The device 1 is configured to receive and process a data stream of scans containing spatial information provided by a 2D or 3D sensor 2 configured to measure distance.

The CNN 3 of the device 1 comprises a first layer and one or more further layers following the first layer (not shown in Figure 2). The CNN 3 of the device 1 further comprises one or more first buffers 4a for storing an output tensor of a respective preceding layer.

The device 1 is configured to input data of a current scan, which is provided by the 2D or 3D sensor 2, in the form of a current tensor into the CNN 3, perform, at each further layer of the CNN 3, one or more convolutional operations on the basis of a current output tensor of the preceding layer originating from the data of the current scan and a previous output tensor of the preceding layer, the previous output tensor of the preceding layer being the newest tensor stored in the respective first buffer 4a of the one or more first buffers 4a and originating from data of a previous scan inputted to the CNN 3 directly before the data of the current scan, and store the current output tensor of the preceding layer as the newest tensor in the respective first buffer 4a.

Embodiments of the CNN 3, in particular an operation of embodiments of the CNN 3, are shown in the Figures 3A, 3B, 4, 5 and 6.

The device 1 of Figure 2 allows to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by the 2D or 3D sensor 2. Since the CNN 3 comprises one or more first buffers 4a for storing the output tensor of a respective preceding layer, the need of re-computation of one or more previous tensors, in particular output tensors, of respective preceding layers for performing one or more convolutional operations at each further layer at a current time point is overcome. Namely, the device 1 is configured to perform, at each further layer, one or more convolutional operations on the basis of the current output tensor of the preceding layer originating from the data of the current scan and the previous output tensor of the preceding layer being the newest tensor stored in the respective first buffer 4a. As a result of using one or more first buffers 4a the inference time of the CNN 3 in the inference phase is reduced.

As shown on the right side of Figure 2 the device 1 and the 2D or 3D sensor 2 may be part of the ego-vehicle 5. That is, the ego-vehicle 5 comprises the 2D or 3D sensor 2 and the device 1. The ego-vehicle 5 may also comprise more than one 2D or 3D sensor 2. Therefore, the ego-vehicle 5 comprises one or more 2D or 3D sensors 2 configured to measure distance and the device 1, wherein the one or more 2D or 3D sensors 2 are configured to provide scans containing spatial information of the vicinity respectively environment of the ego-vehicle 5 in the form of a data stream to the device 1 and the device 1 is configured to process the data stream.

The one or more 2D or 3D sensors 2 configured to measure distance may each comprise or correspond to one or more LIDAR sensors (light detection and ranging sensors), TOF cameras (time of flight cameras), stereo cameras and/or beamforming radars. That is, the one or more 2D or 3D sensors 2 may each comprise or correspond to one or more visual- depth-capable sensors. The one or more 2D or 3D sensors 2 may provide scans at a frame rate of at least 10 to 20 frames per second.

The ego-vehicle 5 may correspond to a vehicle, such as a car, truck, motorcycle etc., an autonomous vehicle, such as an autonomous car, autonomous truck etc., a robot, such as a delivery robot, an autonomous robot, such as an autonomous delivery robot, a flying capable vehicle, such as a flying capable drone, flying capable robot, aircraft etc., or a autonomous flying capable vehicle, such as an autonomous flying capable drone, an autonomous flying capable robot, autonomous aircraft etc.

The ego-vehicle 5 may comprise a localization unit (not shown in Figure 2) configured to determine the current location of the ego-vehicle 5 and, thus, the current location data of the ego-vehicle 5. The localization unit may be configured for a short-term localization, e.g. at the scope of Is to 10s. The localization unit may comprise or correspond to one or more inertial measurement units (IMU). In particular, the localization unit is configured to perform an odometry process, such as an inertial odometry process, a wheel odometry process and/or an optical respectively visual odometry process etc.

The above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 3A. The CNN 3 of Figure 3A may be an embodiment of the CNN 3 of the device 1 of Figure 2.

According to the embodiment of Figure 3 A, the CNN 3 comprises a first layer LI, one further layer L2a and one first buffer 4a, wherein the further layer L2a is a consecutive layer following the first layer LI. That is, according to the embodiment of Figure 3 A, the first layer LI is the preceding layer of the further layer L2a. As described already above, the CNN 3 may comprise more than one further layer. In addition or alternatively, the CNN 3 may comprise one or more optional additional layers (not shown in Figure 3A).

As shown in Figure 3A, at a current time point ti (the time T equals to the time point ti) data of a current scan provided by a 2D or 3D sensor 2 configured to measure distance (not shown in Figure 3 A) are input in the form of a current tensor epc(ti) of the current time point ti into the CNN 3. At the further layer L2a one or more convolutional operations are performed on the basis of a current output tensor al(ti) of the preceding layer LI (which is the first layer LI), wherein the current output tensor al(ti) originates from the data of the current scan of the current time point ti, and a previous output tensor al(ti-i) of the preceding layer LI, wherein the previous output tensor al(ti-i) of the preceding layer LI is the newest tensor stored in the first buffer 4a and originates from data of a previous scan (not shown in Figure 3A) inputted to the CNN directly before the data of the current scan. The previous scan is inputted at a previous time point ti-i directly before the current time point ti and, thus, the previous scan may be referred to as the scan of the previous time point ti-i .

The current output tensor a2(ti) of the preceding layer LI may be stored as the newest tensor in the first buffer 4a (not shown in Figure 3A). Therefore, at the directly subsequent time point ti+i directly after the current time point ti, one or more convolutional operations are performed on the basis of the directly subsequent output tensor al(ti+i) of the preceding layer LI originating from the data of the scan provided by the 2D or 3D sensor at the directly subsequent time point ti_+i and the current output tensor al(ti) stored as the newest tensor in the first buffer 4a at the directly subsequent time point ti+i .

The CNN 3 of Figure 3 A allows to reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by the 2D or 3D sensor 2. Since the CNN 3 comprises the first buffer 4a for storing the output tensor of the first layer LI, the need of re-computation of a previous output tensor al(ti-i) of the first layer LI for performing one or more convolutional operations at the further layer L2a at a current time point ti is overcome. Namely, at the further layer L2a one or more convolutional operations are performed on the basis of the current output tensor al(ti) of the first layer LI originating from the data of the current scan and the previous output tensor al(ti-i) of the first layer LI being the newest tensor stored in the first buffer 4a. As a result of using the first buffer 4a the inference time of the CNN 3 in the inference phase is reduced.

According to Figure 3A the temporal size of the first buffer 4a equals to one (“7”), because the first buffer 4a may store the output tensor of the first layer LI for only one time point. As a result, when the current output tensor al(ti) of the preceding layer LI is stored at the current time point ti as the newest tensor in the first buffer 4a, the previous output tensor al(ti-i) of the preceding layer LI, which was stored as the newest tensor in the first buffer 4a at the previous time point ti-i directly before the current time point ti, is simultaneously dropped respectively deleted at the current time point ti. In particular, the temporal size of the first buffer 4a is one less than the temporal size of a convolutional kernel of the first layer LI. Thus, according to the embodiment of Figure 3A, the temporal size of the convolutional kernel of the first layer LI may correspond to two (“2”)· The temporal size of the first buffer may alternatively correspond to a number greater than one (“1”) and, thus, the first buffer may be configured to store the output tensor of the preceding layer LI for more than one time point, i.e. for two or more directly subsequent time points, respectively for more than one scan, i.e. for two or more directly subsequent scans.

The CNN 3 of Figure 3B may be an embodiment of the CNN 3 of the device 1 of Figure 2. The CNN 3 of Figure 3B differs from the CNN 3 of Figure 3A in that the CNN 3 of Figure 3B comprises a second buffer 4b for storing a tensor input to the CNN 3. Therefore, the description of the CNN 3 of Figure 3A is also valid for the CNN 3 of Figure 3B. The above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 3B. In the following mainly the additional feature(s) of the CNN 3 of Figure 3B respectively the differences between the CNN 3 of Figure 3A and the CNN 3 of Figure 3B are described.

As shown in Figure 3B, the CNN 3 comprises, besides the first layer LI, the one further layer L2a and the first buffer 4a, a second buffer 4b for storing a tensor input to the CNN 3. At the first layer LI, one or more convolutional operations are performed on the basis of the current tensor epc(ti) originating from the data of the current scan of the current time point ti and a previous tensor epc(ti-i) generating a current output tensor al(ti) of the first layer LI originating from the data of the current scan, wherein the previous tensor epc(ti-i) is the newest tensor stored in the second buffer 4b at the time point ti-i and corresponds to the data of the previous scan inputted to the CNN 3 at the time point ti-i directly before the data of the current scan input to the CNN 3 at the current time point ti. The current output tensor al(ti) of the first layer LI may be stored as the newest tensor in the second buffer 4b (not shown in Figure 3B). Therefore, at the directly subsequent time point ti+i directly after the current time point ti, one or more convolutional operations are performed on the basis of the directly subsequent tensor epc(ti_+i) originating from the data of the scan provided by the 2D or 3D sensor at the directly subsequent time point ti+i and the current tensor epc(ti) stored as the newest tensor in the second buffer 4b at the directly subsequent time point ti+i.

Therefore, compared to the CNN 3 of Figure 3A, the CNN 3 of Figure 3B allows to further reduce the amount of computational resources required for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor configured to measure distance. Since the CNN 3 comprises a second buffer 4b for storing the tensor input to the CNN 3, the need of re-computation of a previous tensor epc(t-l) input to the CNN 3 on the basis of the corresponding data of the corresponding previous scan for performing one or more convolutional operations at the first layer LI at the current time point t, is overcome. Namely, at the first layer LI, one or more convolutional operations are performed on the basis of the current tensor epc(ti) and the previous tensor epc(ti-i) being the newest tensor stored in the second buffer 4b.

According to Figure 3B the temporal size of the second buffer 4b equals to one (“7”). because the second buffer 4b may store the tensor input to the CNN 3 for only one time point respectively for one scan. As a result, when the current tensor epc(ti) is stored at the current time point ti as the newest tensor in the second buffer 4b, the previous tensor epc(ti- i), which was stored as the newest tensor in the second buffer 4b at the previous time point ti-i directly before the current time point ti, is simultaneously dropped respectively deleted at the current time point ti.

The temporal size of the second buffer 4b may alternatively correspond to a number greater than one (“7”) and, thus, the second buffer 4b may be configured to store the tensor of the input to the CNN 3 for more than one time point, i.e. for two or more directly subsequent time points, respectively for more than one scan, i.e. for two or more directly subsequent scans. In an embodiment of the present invention, the first buffer 4a and the second buffer 4b have a different temporal size. Alternatively, they may have the same temporal size.

Figure 4 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.

The CNN 3 of Figure 4 may be an embodiment of the CNN 3 of the device 1 of Figure 2. The CNN 3 of Figure 4 differs from the CNN 3 of Figure 3A in that the CNN 3 of Figure 4 comprises a first buffer 4a with a temporal size of two (“2”). Therefore, the description of the CNN 3 of Figure 3A is also valid for the CNN 3 of Figure 4. The above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 4. In the following mainly the additional feature(s) of the CNN 3 of Figure 4 respectively the differences between the CNN 3 of Figure 3 A and the CNN 3 of Figure 4 are described.

According to Figure 4 the temporal size of the first buffer 4a equals to two (“2”), because the first buffer 4a may store the output tensor of the first layer LI (which is the preceding layer of the further layer L2a) for two directly subsequent time points respectively for two directly subsequent scans. As shown in Figure 4, at the current time point ti the first buffer 4a stores as the newest tensor the previous output tensor al(ti-i) of the first layer LI originating from data of a previous scan inputted to the CNN 3 at the previous time point ti-i directly before the data of the current scan, which are input at the current time point ti to the CNN 3 in the form of the current tensor epc(ti), and the second previous output tensor a 1 (ti-2) of the first layer LI originating from data of a second previous scan inputted to the CNN 3 at the second previous time point ti-2 directly before the previous time point tn.

As a result, when the current output tensor al(ti) of the first layer LI is stored at the current time point ti as the newest tensor in the first buffer 4a, the second previous output tensor al(ti-2) of the first layer LI is simultaneously dropped respectively deleted at the current time point ti (This is indicated in the dashed box on the left side of Figure 4). Namely, the temporal size of the first buffer 4a is only two and, thus, the first buffer 4a may store the output tensor of the first layer LI for only two directly subsequent time points, such as for the two directly subsequent time points tw and ti-i or ti-i and ti.

The temporal size of the first buffer may alternatively correspond to a number greater than two (“2”) and, thus, the first buffer may be configured to store the output tensor of the first layer LI for more than two time points, i.e. for three or more directly subsequent time points, respectively for more than two scans, i.e. for three or more directly subsequent scans.

Figure 5 shows an operation of a Convolutional Neural Network (CNN) according to an embodiment of the invention at a current time point ti.

The CNN 3 of Figure 5 may be an embodiment of the CNN 3 of the device 1 of Figure 2. The CNN 3 of Figure 5 differs from the CNN 3 of Figure 4 in that the CNN 3 of Figure 5 comprises a second buffer 4b for storing a tensor input to the CNN 3. Therefore, the description of the CNN 3 of Figure 3B and 4 is also valid for the CNN 3 of Figure 5. The above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 5. In the following mainly the additional feature(s) of the CNN 3 of Figure 5 respectively the differences between the CNN 3 of Figure 4 and the CNN 3 of Figure 5 are described.

As shown in Figure 5, the CNN 3 comprises, besides the first layer LI, the one further layer L2a and the first buffer 4a, a second buffer 4b for storing a tensor input to the CNN 3.

According to Figure 5 the temporal size of the second buffer 4b equals to two (“2”), because the second buffer 4b may store a tensor that is input to the CNN 3 for two directly subsequent time points. As shown in Figure 5, at the current time point ti the second buffer 4a stores as the newest tensor the previous tensor epc(ti-i) corresponding to data of a previous scan inputted to the CNN 3 at the previous time point ti-i directly before the data of the current scan, which are input at the current time point ti to the CNN 3 in the form of the current tensor epc(ti), and the second previous tensor epc(tj-2) corresponding to data of a second previous scan inputted to the CNN 3 at the second previous time point ti-2 directly before the previous time point ti-i.

As a result, when the current tensor al(ti) is stored at the current time point ti as the newest tensor in the second buffer 4b, the second previous tensor epc(ti-2) is simultaneously dropped respectively deleted at the current time point ti (This is indicated in the dashed box on the left side of Figure 5). Namely, the temporal size of the second buffer 4b is only two and, thus, the second buffer 4b may store the tensor that is input to the CNN 3 for only two directly subsequent time points, such as for the two directly subsequent time points ti-2 and ti-i or ti-i and ti.

The temporal size of the second buffer 4b may alternatively correspond to a number greater than two (“2”) and, thus, the second buffer 4b may be configured to store the tensor input to the CNN 3 for more than two time points, i.e. for three or more directly subsequent time points.

The CNN 3 of Figure 6 may be an embodiment of the CNN 3 of the device 1 of Figure 2. The CNN 3 of Figure 6 differs from the CNN 3 of Figure 3 A in that the CNN 3 of Figure 6 comprises three further layers L2a, L2b and L2c and an optional second buffer 4b for storing a tensor input to the CNN 3 and in that an optional voxelization VX is performed. Therefore, the description of the CNN 3 of Figures 3 A and 3B is also valid for the CNN 3 of Figure 6. The above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 6. In the following mainly the additional feature(s) of the CNN 3 of Figure 6 respectively the differences between the CNN 3 of Figure 3 A and the CNN 3 of Figure 6 are described.

According to the embodiment of Figure 6, the CNN 3 comprises a first layer LI, three further layers L2a, L2b and L2c, three first buffers 4a and one optional second buffer 4b. The first further layer L2a is a consecutive layer following the first layer LI, the second further layer L2b is a consecutive layer following the first further layer L2a and the third further layer L2c is a consecutive layer following the second further layer L2b. That is, according to the embodiment of Figure 6, the first layer LI is the preceding layer of the first further layer L2a, the first further layer L2a is the preceding layer of the second further layer L2b and the second further layer L2b is the preceding layer of the third further layer L2c.

As described already above, the CNN 3 may comprise only one further layer or only two further layers or more than three further layers. In addition or alternatively, the CNN 3 may comprise one or more optional additional layers (not shown in Figure 6). In an embodiment one or more of the one or more optional additional layers may be arranged between further layers and/or between the first layer and the first further layer.

As shown in Figure 6, at the current time point f data pc(ti) of a current scan provided by a 2D or 3D sensor configured to measure distance (not shown in Figure 6), such as the 2D or 3D sensor 2 of Figure 2, are input in the form of a current tensor epc(ti) of the current time point f into the CNN 3. The current tensor epc(ti) may be generated on the basis of the data pc(ti) of the current scan by optionally performing a voxelization VX.

In an embodiment of the invention, the current tensor epc(ti) may be generated on the basis of the data pc(ti) of the current scan and the current location data of an ego-vehicle, such as the ego-vehicle 5 of Figure 2, in the grid of a local navigational frame by performing a voxelization VX, in case the 2D or 3D sensor providing the data pc(ti) of the current scan is arranged on the ego-vehicle.

In particular, the data of the current scan pc(ti) corresponds to a point cloud, wherein a point cloud is a set of points in an N-dimensional space along with their attributes, wherein N corresponds to the number of spatial dimensions. At each further layer Lj (j may be 2a, 2b or 2c) one or more convolutional operations are performed on the basis of a current output tensor a (ti) of the preceding layer L_v, wherein the current output tensor a_w(ti) originates from the data pc(ti) of the current scan of the current time point ti, and a previous output tensor aw(ti-i) of the preceding layer L_v, wherein the previous output tensor a_w(ti-i) of the preceding layer L_v is the newest tensor stored in the respective first buffer 4a and originates from data of a previous scan (not shown in Figure 6) inputted to the CNN directly before the data pc(ti) of the current scan (w = 1 and v = 1, in case j = 2a; w = 2 and v = 2a, in case j = 2b; and w = 3 and v = 2b, in case j = 2c).

The previous scan is inputted at a previous time point ti-i directly before the current time point ti and, thus, the previous scan may be referred to as the scan of the previous time point ti-i.

The current output tensor a_w(ti) of the preceding layer L_v may be stored as the newest tensor in the corresponding first buffer 4a (shown in the dashed box on the left side Figure 6). Therefore, at the directly subsequent time point h_+i directly after the current time point ti, at each further layer L_j (j may be 2a, 2b or 2c) one or more convolutional operations are performed on the basis of the directly subsequent output tensor a_w(ti+i) of the preceding layer L_v originating from the data of the scan provided by the 2D or 3D sensor at the directly subsequent time point ti_+i and the current output tensor a_w(ti) stored as the newest tensor in the corresponding first buffer 4a at the directly subsequent time point ti+i.

According to Figure 6 the temporal size of the first buffers 4a equals to one (“7”), because the first buffers 4a may store the output tensor of the corresponding layer (preceding layer) for only one time point respectively one scan. As a result, when at the current time point ti the current output tensor a_w(ti) of the corresponding layer L_v (preceding layer) is stored as the newest tensor in the corresponding first buffer 4a, the previous output tensor aw(ti-i) of the corresponding layer L_v, which was stored as the newest tensor in the corresponding first buffer 4a at the previous time point ti-i directly before the current time point ti, is simultaneously dropped respectively deleted at the current time point ti (This is indicated in the dashed box on the left side; w = 1, in case v = 1 ; w = 2, in case v = 2a; and w =3, in case v = 2b). In particular, the temporal size of a first buffer 4a is one less than the temporal size of a convolutional kernel of the corresponding layer (preceding layer). Thus, according to the embodiment of Figure 6, the temporal size of the convolutional kernel of the layers LI, L2a and L2b may correspond to two (“2”).

The temporal size of one or more first buffers 4a may alternatively correspond to a number greater than one (“7”) and, thus, the one or more first buffers 4a may be configured to store the output tensor of a corresponding preceding layer for more than one time point, i.e. for two or more directly subsequent time points, respectively for more than one scan, i.e. for two or more directly subsequent scans.

According to the embodiment of Figure 6, at the first layer LI one or more convolutional operations are performed on the basis of the current tensor epc(ti) originating from the data pc(tj) of the current scan of the current time point (ti) and a previous tensor epc(ti-i) generating a current output tensor a 1 (ti) of the first layer LI originating from the data pc(ti) of the current scan. The previous tensor epc(ti-i) is the newest tensor stored in the optional second buffer 4b and corresponds to the data of the previous scan inputted to the CNN 3 at the previous time point ti-i directly before the data pc(ti) of the current scan.

The current output tensor al(ti) of the first layer LI may be stored as the newest tensor in the optional second buffer 4b (shown in the dashed box on the left of Figure 6). Therefore, at the directly subsequent time point ti_+i directly after the current time point ti, one or more convolutional operations are performed on the basis of the directly subsequent tensor epc(ti_+i) originating from the data of the scan provided by the 2D or 3D sensor at the directly subsequent time point ti_+i and the current tensor epc(ti) stored as the newest tensor in the optional second buffer 4b at the directly subsequent time point ti+i .

According to Figure 6 the temporal size of the optional second buffer 4b equals to one (“7”), because the optional second buffer 4b may store the tensor input to the CNN 3 for only one time point respectively one scan. As a result, when the current tensor epc(ti) is stored at the current time point ti as the newest tensor in the second buffer 4b, the previous tensor epc(ti-i), which was stored as the newest tensor in the second buffer 4b at the previous time point ti-i directly before the current time point ti, is simultaneously dropped respectively deleted at the current time point ti (shown in the dashed box on the left of Figure 6).

The temporal size of the optional second buffer 4b may alternatively correspond to a number greater than one (“7”) and, thus, the second buffer 4b may be configured to store the tensor of the input to the CNN 3 for more than one time point, i.e. for two or more directly subsequent time points, respectively for more than one scan, i.e. for two or more directly subsequent scans.

In an embodiment of the present invention, one or more of the first buffers 4a and the optional second buffer 4b have a different temporal size. Alternatively, the first buffers 4a and the optional second buffer 4b may have the same temporal size.

Figure 7 shows a grid G of a local navigational frame LNF with a location of an ego-vehicle according to an embodiment of the invention for two directly subsequent time points.

The ego-vehicle (not shown in Figure 7) may correspond to the ego-vehicle 5 of Figure 2. The above description with respect to the device according to the first aspect of the invention and its implementation forms and the above description of the ego-vehicle according to the third aspect of the invention and its implementation forms are correspondingly valid for the description of Figure 7, in particular for the description of the ego-vehicle 5 of Figure 7.

In the following reference is made to the ego-vehicle 5 shown in Figure 2. The ego-vehicle 5 may be configured for autonomous movement. In particular, the device 1 of the ego- vehicle 5 is configured to control autonomous movement of the ego-vehicle 5 on the basis of the processing of the data stream of scans provided by the one or more 2D or 3D sensors 2 configured to measure distance and arranged on the ego-vehicle 5.

During the autonomous movement of the ego-vehicle 5, detection in the local navigational frame LNF is performed by the one or more 2D or 3D sensors 2. The local navigational frame LNF shown in Figure 7 is a coordinate system of two spatial dimension x, y with top-down view on the ground surface, which is relative to a position of the ego-vehicle 5. That is, the location respectively position of the ego-vehicle 5 is within the local navigational frame LNF. The terms “ location ” and “ position ” may be used as synonyms.

In the local navigational frame LNF, the data (e.g. a point cloud) of a current scan (of the data stream of scans provided by the one or more 2D or 3D sensors 2) is registered according to the current location of the ego-vehicle 5. The region of interest (ROI) for processing the data stream of scans provided by the one or more 2D or 3D sensors 2 is moving with the location of the ego-vehicle 5 in the local navigational frame LNF, when the ego-vehicle 5 is moving. In particular, the ROI corresponds to a predefined regular grid respectively area of the local navigational frame LNF with the ego-vehicle 5 in the center wherein the regular grid is composed of a plurality of cells. Thus, Figure 7 shows the grid Gi (area) of the current ROI Ri at a current time point ti and the grid Gi-i (area) of the previous ROI Ri-i of a previous time point ti-i that is directly before the current time point ti-i . As an example, the ROI may be chosen to correspond to a grid in the local navigational frame LNF of size 140 m x 140 m. The cells may each be of size 0.5m x 0.5 m. In particular, the ROI may be chosen to correspond to a grid in the local navigational frame LNF of size between 70 m x 70 m and 250 m x 250. In particular, the cells may each be of size between 0.05 m x 0.05 m and 0.5 m x 0.5 m.

According to Figure 7, the data of a current scan (e.g. all points of a point cloud, such as a LIDAR point cloud in case of the 2D or 3D sensor being a LIDAR sensor) are mapped into the regular grid G of the local navigational frame LNF. That is, the device 1 is configured to generate on the basis of the data of the current scan and the current location data pxi and pyi of the ego-vehicle in the grid G of the local navigational frame LNF the current tensor input to the CNN by performing a change of coordinates of the data of the current scan into the local navigational frame LNF. The current location data pxi and pyi describe the current location pi respectively position of the ego-vehicle 5 within the grid G of the local navigational frame LNF at a current time point ti. Thus, the tensors stored in the buffers of the CNN 3 correspond to the corresponding cells of the regular grid G of the local navigational frame LNF.

The grid Gi (area) of Figure 7 corresponds to a current tensor that is stored in a corresponding buffer of the CNN 3 as the newest tensor at the current time point ti. The grid G, corresponds to the predefined grid (area) of the current ROI Ri at the current time point ti. In Figure 7, the grid Gi is made of the plain cells without any pattern and the cells with a dotted pattern within the corresponding bold frame. The grid GM of Figure 7 corresponds to a previous tensor that is stored in the same buffer of the CNN 3 as the newest tensor at a previous time point ti-i directly before the current time point ti. The grid Gi-i corresponds to the predefined grid (area) of the previous ROI R at the previous time point ti-i. In Figure 7, the grid Gi-i is made of the cells with a diagonally stripped pattern and the plain cells without any pattern within the corresponding bold frame.

In (the grid G of) the local navigational frame LNF the grid Gi of the current time point ti is not congruent to the grid Gj-i of the previous time point tj-i, because the ego-vehicle 5 has moved from the previous location pi-i of the previous time point ti-i to the current location pi of the current time point ti. That is, at the previous time point ti-i the ego-vehicle 5 was at the previous location pi-i and at the current time point ti the ego-vehicle 5 is at the current position pi. As a result of this movement, the ROI changes from the previous ROI Ri-i to the current ROI Ri. The previous location pi-i of the ego-vehicle 5 is described by the previous location data pxi-i, pyi-i and the current location pi of the ego-vehicle 5 is described by the current location data pxi, pyi.

In case the current location data pxi, pyi of the ego-vehicle 5 do not match the previous location data pxi-i, pyi-i inputted together with the data of the previous scan, in each buffer of the CNN the newest tensor being stored is padded and cropped. That is, the device 1 of the ego-vehicle 5 may be configured to input, together with the data of the current scan, current location data pxi, pyi of the ego-vehicle 5 in the grid G of the local navigational frame LNF to the CNN 3 (the CNN is not shown in Figure 7), and pad and crop (in particular zero-pad and crop) in each buffer of the CNN the newest tensor being stored, in case the current location data pxi, pyi of the ego-vehicle 5 do not match previous location data pxi-i, pyi-i inputted together with the data of the previous scan

Thus, according to Figure 7, the plain cells within the bold frame correspond to the data of the newest tensor being stored in a buffer that may be re-used despite the location change of the ego-vehicle 5 and, thus, despite the change of the ROI from the previous ROI R to the current ROI Ri. The cells with the diagonally stripped pattern correspond to the data of the newest tensor stored in the buffer that are cropped. The terms “ dropped ’ or “ deleted ” may be used as synonyms for the term “ cropped \ And the cells with the dotted pattern correspond to the data of the newest tensor stored in the buffer that are padded, in particular that are zero-padded. Padding data may be understood as overwriting the values of the data by a predefined value. Thus, zero-padding data may be understood as overwriting the values of the data with zeros.

In case the ego-vehicle 5 moves within the same cell of the grid G (not shown in Figure 7), there is no padding and cropping performed. In other words, in case the current location data pxi, pyi of the ego-vehicle 5 do not match previous location data pxi-i, pyi-i inputted together with the data of the previous scan, but the current location data pxi, pyi and the previous location data pxi-i, pyi-i of the ego-vehicle 5 lie respectively are located within the same cell of the grid G of the local navigational frame LNF, padding and cropping is not performed.

The device 1 of the ego-vehicle 5 may be configured to generate on the basis of the data of the current scan and the current location data pxi, pyi of the ego-vehicle 5 in the grid G of the local navigational frame LNF the current tensor that is input to the CNN 3 by performing a change of coordinates of the data of the current scan into the local navigational frame LNF.

Further, the device 1 of the ego-vehicle 5 may be configured to generate on the basis of the data of the current scan and the current location data pxi, pyi of the ego-vehicle 5 in the grid G of the local navigational frame LNF the current tensor that is input to the CNN 3 by additionally performing a voxelization (not shown in Figure 7).

The location data may be stored in a location field of one or more of the one or more first buffers 4a and the optional second buffer of the CNN, and the location field of the respective one or more buffers is updated (not shown in Figure 7) with the current location data pxi, pyi, in case the current location data px,, pyi of the ego-vehicle 5 do not match the previous location data pxi-i, pyi-i.

In case the ego-vehicle 5 moves within the same cell of the grid G, there is no updating of the location field of the respective one or more buffers. In other words, in case the current location data px,, py, of the ego-vehicle 5 do not match previous location data pxi-i, pyi-i inputted together with the data of the previous scan, but the current location data pxi, pyi and the previous location data pxi-i, pyi-i of the ego-vehicle 5 lie respectively are located within the same cell of the grid G, the location field of the respective one or more buffers is not updated with the current location data pxi, pyi.

Figure 8 shows a diagram of a method according to an embodiment of the invention with respect to storing tensors originating from data of a current scan of a current time point ti in buffers of a CNN according to an embodiment of the invention.

In the following reference is made to the ego-vehicle 5 of Figure 2 and the local navigational frame of Figure 7.

In the first step S81 of the method of Figure 8 the device 1 of the ego-vehicle 5 obtains data of a current scan provided by the one or more 2D or 3D sensors 2 in the form of a current tensor of shape [C, H, W] together with current location data pxi, pyi of the ego- vehicle 5 in the grid G of the local navigational frame LNF at a current time point ti. “C” is the number of channels. The current tensor is generated by performing a change of coordinates of the data of the current scan into the local navigational frame LNF. Thus, “H” is the size of the current tensor along the y coordinate and “W” is the size of the current tensor along the x coordinate.

In the second step S82 following the first step S81, the device 1 determines whether the current location data pxi, pyi do not match the previous location data pxi-i, pyi-i of the ego- vehicle 5. That is, the device 1 determines whether the location of the ego-vehicle 5 has changed. In particular, the device 1 determines whether the current location data pxi, pyi do not match the previous location data pxi-i, pyi-i of the ego-vehicle 5 and whether the current location data px,, pyi and the previous location data pxi-i, pyi-i do not lie within the same cell of the grid G of the local navigational frame LNF. That is, the device 1 in particular determines whether the location of the ego-vehicle 5 has changed such that the current location of the ego-vehicle is in a different cell compared to the cell of the previous location. In case the determination of the second step S82 yields a “YES” (the location of the ego-vehicle has changed), the method proceeds to the third step S83. In case the determination of the second step S82 yields a “NO” (the location of the ego-vehicle has not changed or the location of the ego-vehicle is still in the same cell), the method proceeds to the fifth step S85.

In the third step S83, the device 1 pads and crops, in particular zero-pads and crops, in each buffer of the CNN 3 the newest tensor being stored to match to the current location data pxi, py, and, thus, the current ROI Ri. That is, the tensor that is stored at the previous time point ti-i directly before the current time point ti in a respective buffer as the newest tensor and that is still stored at the third step S83 as the newest tensor of the respective buffer is padded and cropped.

In the fourth step S84 following the third step S83, the device 1 updates the location field with the current location data pxi, py, of the ego-vehicle 5 in the buffers comprising a location field.

In the fifth step S85, the device 1 stores in each buffer the corresponding tensor originating from the data of the current scan as the newest tensor and simultaneously drops respectively deletes the oldest tensor in the buffers that are full. In particular, the device 1 stores in each first buffer 4a the output tensor of the corresponding layer (preceding layer) originating from the data of the current scan as the newest tensor. Moreover, in case the CNN 3 of the device 1 comprises a second buffer, the device 1 stores in the second buffer the current tensor originating from the data of the current scan as the newest tensor.

In particular, the method of Figure 8 is repeated by the device 1 of the go-vehicle 5 for each new time point respectively new scan.

Figure 9 shows a diagram of a method of employing a Convolutional Neural Network (CNN) in an inference phase according to an embodiment of the invention for processing a data stream of scans containing spatial information provided by one or more 2D or 3D sensor configured to measure distance. The method of Figure 9 may be performed by the device 1 of Figure 2. The above description of the device according to the first aspect and its implementation forms, the above description of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the method of Figure 9.

According to Figure 9, the method step SI 01 and the optional method step SI 02 may be performed on the basis of data pc(ti) of a current scan provided by a 2D or 3D sensor configured to measure distance, such as a LIDAR sensor, at a current time point ti. The 2D or 3D sensor is arranged on an ego-vehicle, such as the ego-vehicle 5 of Figure 2. In the step S 101, the current tensor epc(ti), that is input to the Convolutional Neural Network 3 (CNN), is generated on the basis of the data pc(ti) of the current scan and the current location data pxi, pyj of the ego-vehicle in a grid of a local navigational frame, such as the local navigational frame of Figure 7, by performing a change of coordinates of the data pc(ti) into the local navigational frame. On the basis of the data pc(ti) of the current scan and the current location data pxi, pyi of the ego-vehicle in the grid of the local navigational frame the current tensor epc(ti), that is input to the Convolutional Neural Network 3 (CNN), may be generated by additionally performing a voxelization of the optional method step SI 02.

The CNN 3 is configured to generate on the basis of the current tensor epc(ti) the current output data OUT(ti), on the basis of which a navigation process of the ego-vehicle may be performed.

The CNN 3 shown in Figure 9 may correspond to any CNN 3 of Figures 2, 3A, 3B, 4, 5 and 6. The above description of the CNN of the device according to the first aspect and its implementation forms, the above description of the CNN of the hardware implementation of a CNN according to the second aspect and its implementation forms as well as the above description of the method according to the fourth aspect and its implementation forms are correspondingly valid for the CNN 3 of Figure 9.

Optionally, in a method step SI 03, the current output data OUT(ti) may be decoded and a non-maximum suppression may be performed thereon. On the basis of the processing result of the method step SI 03 detected object boxes may be provided. The size of the data pc(ti) being a point cloud is “M * 4“, wherein “M” is the number of points in the point cloud and the “4” indicates that each point in the point cloud comprises three spatial dimensions (point cloud with three spatial dimensions) and one attribute, such as a scalar reflection brightness attribute.

The size of the current location data px,, pyi is two (“2”) because the location of the ego- vehicle in the local navigational frame, being a 2-dimensional coordinate system according to the embodiment of Figure 9, is described by two coordinates (x and y coordinate).

The size of the current tensor epc(ti) being an encoded point cloud is “C*H*W". “C” is the number of channels of the current tensor epc(ti) (number of channels in the encoded point cloud), wherein C is greater than or equal to one (C > 1). Ή” is the size of the current tensor epc(ti) along the y coordinate of the local navigational frame and “W” is the size of the current tensor along the x coordinate of the local navigational frame. “H” and “W” define the spatial resolution of the area of the region of interest (ROI) of the grid of the local navigational frame. For example, “H” and “W” may each be 280 cells of the grid of the local navigational frame, such that the ROI corresponds to an area of 140 m x 140 m, in case the cells of the regular grid of the local navigational frame each are of size 0.5m x 0.5 m.

The size of the output data OUT(ti) is “C2*H*W”, wherein “H” and “W” are described as above.”C2” is the number of channels of the output data OUT(ti) that may be the same or different than the number of channels (“C”) of the current tensor epc(ti).

“B” is the number of detected objects and “S” is the size of the metadata related to one object.

The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A device (1) for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor (2) configured to measure distance, wherein the device (1) is configured to employ a Convolutional Neural Network (3), CNN, in an inference phase, the CNN comprising a first layer (LI) and one or more further layers (L2a, L2b, L2c) following the first layer (LI) and one or more first buffers (4a) for storing an output tensor of a respective preceding layer (LI ; L2a; L2b), input data (pc(ti)) of a current scan provided by the 2D or 3D sensor (2) in the form of a current tensor (epc(ti)) into the CNN (3), perform, at each further layer (L2a; L2b; L2c), one or more convolutional operations on the basis of a current output tensor (al(ti); a2(ti); a3(ti)) of the preceding layer (LI ; L2a, L2b) originating from the data (pc(t )) of the current scan and a previous output tensor (al(ti-i); a2(ti-i); a3(ti-i)) of the preceding layer (LI; L2a; L2b), the previous output tensor (al(ti-i); a2(ti-i); a3 (ti-i)) of the preceding layer (LI ; L2a; L2b) being a newest tensor stored in the respective first buffer (4a) of the one or more first buffers (4a) and originating from data of a previous scan that was input to the CNN directly before the data (pc(ti)) of the current scan, and store the current output tensor (al (f); a2(t,); a3(t)) of the preceding layer (L 1 ; L2a; L2b) as the newest tensor in the respective first buffer (4a).

2. The device (1) according to claim 1, wherein the CNN (3) comprises a second buffer (4b) for storing a tensor (epc(ti-i), epc(ti)) input to the CNN (3), and the device (1) is configured to perform, at the first layer (LI), one or more convolutional operations on the basis of the current tensor (epc(ti)) and a previous tensor (epc(ti-i)) generating a current output tensor (al(ti)) of the first layer (LI) originating from the data (pc(ti)) of the current scan, the previous tensor (epc(ti-i)) being the newest tensor stored in the second buffer (4b) and corresponding to the data of the previous scan, and store the current tensor (epc(ti)) as the newest tensor in the second buffer

(4b).

3. The device (1) according to claim 1 or 2, wherein each buffer (4a, 4b) is a serial-in parallel-out buffer configured to store a new tensor as newest tensor and, in case the buffer is full, to simultaneously drop the oldest tensor stored in the buffer.

4. The device (1) according to any one of the preceding claims, wherein the temporal size of each of the one or more first buffers (4a) for storing the output tensor of a respective preceding layer (LI, L2a, L2b) is one less than the temporal size of a convolutional kernel of the respective preceding layer (LI, L2a, L2b), and in particular one or more buffers (4a, 4b) have a different temporal size.

5. The device (1) according to any one of the preceding claims, wherein each tensor is a tensor with two or more dimensions, in particular with one or more spatial dimensions and one channel dimension.

6. The device (1) according to any one of the preceding claims, wherein each tensor is a tensor with two dimensions, in particular with one spatial dimension and one channel dimension; a tensor with three dimensions, in particular with two spatial dimensions and one channel dimension; or a tensor with four dimensions, in particular with three spatial dimensions and one channel dimension.

7. The device (1) according to any one of the preceding claims, the channel dimension of each first buffer (4a) for storing the output tensor of a respective preceding layer (LI, L2a, L2b) corresponds to the channel dimension of the respective preceding layer (LI, L2a, L2b).

8. The device (1) according to any one of the preceding claims configured to generate output data for a navigation process of an ego-vehicle (5), on which the 2D or 3D sensor (2) is arranged, wherein the device (1) is configured to input, together with the data pc(ti) of the current scan, current location data (pxi, py of the ego-vehicle in a grid (G) of a local navigational frame (LNF) to the CNN (3), and pad and crop (S83) in each buffer (4a, 4b) the newest tensor being stored, in case the current location data (pxi, pyO of the ego-vehicle do not match previous location data (pxi-i, pyi-i) inputted together with the data of the previous scan.

9. The device (1) according to claim 8, wherein the device is configured to store the location data in a location field of one or more of the one or more first buffers (4a) and the second buffer (4b), and update (S84) the location field of the respective one or more buffers with the current location data in case the current location data (pxi, pyO of the ego-vehicle do not match the previous location data (pxi-i, pyi-i).

10. The device (1) according to claim 8 or 9, wherein the device ( 1 ) is configured to generate (S 101 ) on the basis of the data (pc(ti))of the current scan and the current location data (pxi, pyO of the ego-vehicle (5) in the grid (G) of the local navigational frame (LNF) the current tensor input (epc(ti)) to the CNN (3) by performing a change of coordinates of the data (pc(ti)) of the current scan into the local navigational frame (LNF).

11. The device (1) according to claim 10, wherein the device ( 1 ) is configured to generate on the basis of the data (pc(ti)) of the current scan and the current location data (pxi, pyO of the ego-vehicle (5) in the grid (G) of the local navigational frame (LNF) the current tensor input (epc(ti)) to the CNN (3) by additionally performing a voxelization (SI 02).

12. The device (1) according to any one of the preceding claims, wherein the device (1) is configured to detect targets, in particular moving targets, in the vicinity of the 2D or 3D sensor (2), perform a point cloud semantic segmentation, and/or perform a free space estimation on the basis of the output tensor of the last layer of the one or more further layers of the CNN (3).

13. A device comprising a Convolutional Neural Network (3), CNN, for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor (2) configured to measure distance, the CNN (3) comprising a first layer (LI) and at least one or more further layers (L2a, L2b, L2c) following the first layer (Li) and one or more first buffers (4a) for storing an output tensor of a respective preceding layer (LI; L2a; L2b), wherein the hardware implementation of the CNN is configured to input data (pc(ti)) of a current scan provided by the 2D or 3D sensor (2) in the form of a current tensor (epc(ti)) into the CNN (3), perform, at each further layer (L2a; L2b; L2c), one or more convolutional operations on the basis of a current output tensor (al(ti);a2(ti); a3(ti)) of the preceding layer (LI ; L2a; L2b) originating from the data (pc(ti)) of the current scan and a previous output tensor (al(ti-i); a2(ti-i); a3(ti-i)) of the preceding layer (LI; L2a; L2b), the previous output tensor (al(ti-i); a2(ti-i); a3(ti-i) ) of the preceding layer (LI ; L2a; L2b) being the newest tensor stored in the respective first buffer (4a) of the one or more first buffers (4a) and originating from data of a previous scan that was input to the CNN (3) directly before the data (pc(ti)) of the current scan, and store the current output tensor (al(ti); a2(h); a3(t,)) of the preceding layer LI ; L2a; L2b) as the newest tensor in the respective first buffer (4a).

14. An ego-vehicle (5) comprising one or more 2D or 3D sensors (2) configured to measure distance, and a device (1) according to any one of claims 1 to 12, wherein the one or more 2D or 3D sensors (2) are configured to provide scans containing spatial information of the vicinity of the ego-vehicle (5) in the form of a data stream to the device (1) and the device (1) is configured to process the data stream.

15. The ego-vehicle (5) according to claim 14, wherein the device (1) is configured to control autonomous movement of the ego-vehicle (5) on the basis of the processing of the data stream.

16. A method of employing a Convolutional Neural Network (3), CNN, in an inference phase, for processing a data stream of scans containing spatial information provided by a 2D or 3D sensor (2) configured to measure distance, the CNN (3) comprising a first layer (LI) and at least one or more further layers (L2a, L2b, L2C) following the first layer (LI) and one or more first buffers (4a) for storing an output tensor of a respective preceding layer (LI; L2a; L2b), wherein the method comprises the steps of inputting data (pc(ti)) of a current scan provided by the 2D or 3D sensor (2) in the form of a current tensor (epc(ti)) into the CNN (3), performing, at each further layer (L2a; L2b; L2c), one or more convolutional operations on the basis of a current output tensor (al(ti); a2(ti); a3(h)) of the preceding layer (LI ; L2a; L2b) originating from the data (pc(ti)) of the current scan and a previous output tensor (al(ti-i); a2(ti-i); a3(ti-i)) of the preceding layer (LI; L2a; L2b), the previous output tensor (al(ti-i); a2(ti-i); a3(ti-i)) ofthe preceding layer (LI; L2a; L2b) being the newest tensor stored in the respective first buffer (4a) of the one or more first buffers (4a) and originating from data of a previous scan that was input to the CNN (3) directly before the data of the current scan, and storing the current output tensor (al(t); a2(tj); a3(tj)) of the preceding layer (LI; L2a; L2b) as the newest tensor in the respective first buffer (4a).

17. A computer program comprising program code for performing, when implemented on a processor, a method according to claim 16.

18. A computer comprising a memory and a processor, which are configured to store and execute program code to perform the method according to claim 16.