WO2019137915A1

WO2019137915A1 - Generating input data for a convolutional neuronal network

Info

Publication number: WO2019137915A1
Application number: PCT/EP2019/050343
Authority: WO
Inventors: Stephen FOY; Rosalia BARROS-QUINTANA; Ian Clancy
Original assignee: Connaught Electronics Ltd.
Priority date: 2018-01-09
Filing date: 2019-01-08
Publication date: 2019-07-18
Also published as: DE102018100315A1

Abstract

The present invention relates to a method for generating input data for a convolutional neuronal network, using a at least one camera (3) and at least one range sensor (5, 6), the camera (3) and the range sensor (5, 6) being arranged on the automotive vehicle (1) in such a way that the field of view of the camera (3) at least partially overlaps with the field of view of the range sensor (5, 6), the method comprising the following method steps: - acquiring an image frame with the camera (3), the image frame being comprised of image data for directions relative to the position of the camera (3) and within the solid angle seen by the camera (3), the directions being expressed by coordinates in a camera coordinate system, - simultaneously acquiring depth information with the range sensor (5, 6), the depth information being comprised of depth data for directions relative to the position of the range sensor (5, 6) and within the solid angle seen by the range sensor (5, 6), the directions being expressed by coordinates in a range sensor coordinate system, - providing an automotive vehicle coordinate system which is related to the camera coordinate system and the range sensor coordinate system by respective sets of translations and rotations given by the position of the camera (3) and the position of the range sensor (5, 6) relative to the origin of the automotive vehicle coordinate system, respectively, - transforming the coordinates in the camera coordinate system and the coordinates in the range sensor coordinate system into coordinates in the automotive coordinate system on the basis of the sets of translations and rotations yielding the input data for the convolutional neural network. In this way, semantic segmentation of objects in an image in automotive computer vision can be enhanced.

Description

Generating input data for a convolutional neuronal network

The invention relates to a method for generating input data for a convolutional neuronal network using at least one camera and at least one range sensor.

One of the most fundamental problems in automotive computer vision is the semantic segmentation of objects in an image. The segmentation approach refers to the problems of associating every pixel to its corresponding object class. In recent times, there was a surge of convolution neural network (CNN) research and design aided by increase in computational power in computer architectures and the availability of large annotated datasets.

CNNs are highly successful at classification and categorization tasks but much of the research is on standard photometric RGB images and is not focused on embedded automotive devices. Automotive hardware devices need to have low power consumption requirements and thus low computational power.

In machine learning, a convolutional neural is a class of deep, feed-forward artificial neural networks that has successfully been applied to analyzing visual imagery. CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. Convolutional networks were inspired by biological processes in which the connectivity pattern between neurons is inspired by the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage. CNNs have applications in image and video recognition, recommender systems and natural language processing.

The article“Multimodal Deep Learning for Robust RGB-D Object Recognition Andreas Eitel, Jost Tobias Springenberg, Luciano Spinello, Martin Riedmiller, Wolfram Burgard IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),

Hamburg, Germany, 2015” proposes a RGB-D architecture for object recognition. This architecture is composed of two separate CNN processing streams - one for each modality - which are consecutively combined with a late fusion network. The focus is on learning with imperfect sensor data, a typical problem in real-world robotics tasks. For accurate learning, a multi-stage training methodology and two crucial ingredients for handling depth data with CNNs are introduced. The first, an effective encoding of depth information for CNNs that enables learning without the need for large depth datasets. The second, a data augmentation scheme for robust learning with depth images by corrupting them with realistic noise patterns.

From US 2017/0099200 A1 it is known that data is received characterizing a request for agent computation of sensor data. The request includes a required confidence and required latency for completion of the agent computation. Agents to query are determined based on the required confidence. Data is transmitted to query the determined agents to provide analysis of the sensor data.

It is an objective of the present invention to provide a possibility for enhancing semantic segmentation of objects in an image in automotive computer vision.

This object is addressed by the subject matter of the independent claims. Preferred embodiments are described in the sub claims.

Therefore, the invention provides a method for generating input data for a convolutional neuronal network, using a at least one camera and at least one range sensor, the camera and the range sensor being arranged on the automotive vehicle in such a way that the field of view of the camera at least partially overlaps with the field of view of the range sensor, the method comprising the following method steps:

acquiring an image frame with the camera, the image frame being comprised of image data for directions relative to the position of the camera and within the solid angle seen by the camera, the directions being expressed by coordinates in a camera coordinate system,

simultaneously acquiring depth information with the range sensor, the depth information being comprised of depth data for directions relative to the position of the range sensor and within the solid angle seen by the range sensor, the directions being expressed by coordinates in a range sensor coordinate system, providing an automotive vehicle coordinate system which is related to the camera coordinate system and the range sensor coordinate system by respective sets of translations and rotations given by the position of the camera and the position of the range sensor relative to the origin of the automotive vehicle coordinate system, respectively,

transforming the coordinates in the camera coordinate system and the coordinates in the range sensor camera system into coordinates in the automotive camera system on the basis of the sets of translations and rotations yielding the input data for the convolutional neural network.

Hence, it is an essential idea of the invention that the input data for the convolutional neuronal network comprises both image data and depth data for common viewing directions relative to the origin of the automotive vehicle coordinate system, the directions being expressed with coordinates of the common automotive vehicle coordinate system which serves as a common frame. In other words: The input data for the convolutional neural network is comprised of image data and depth data for directions expressed in the automotive vehicle coordinate system though such data was originally captured and expressed as data in the coordinate system of the camera or the range sensor, respectively. The transformation of this data into the common automotive coordinate system provides for the possibility of using data from different

sensors/cameras in a common data set which is input into the convolutional neural network. Preferably, the camera consecutively acquires image frames and the range sensor consecutively acquires depth information. Preferably, as a last step of the method described before, the generated data set comprised by the depth data and the image data is input into the CNN.

According to a preferred embodiment of the invention, the method further comprises the following steps:

expressing the coordinates in the camera coordinate system by a direction cosine matrix, and

expressing the coordinates in the range sensor coordinate system by a direction cosine matrix. As known to the man skilled in the art, the direction cosines of a vector are

the cosines of the angles between the vector and the three coordinate axes.

Equivalently, they are the contributions of each component of the basis to a unit vector in that direction. Direction cosines are an analogous extension of the usual notion of slope to higher dimensions. Hence, direction cosine refers to the cosine of the angle between any two vectors. They are inter alia used for forming direction cosine matrices that express one set of orthonormal basis vectors in terms of another set, or for expressing a known vector in a different basis.

Preferably, the method further comprises the following steps:

expressing the image data by a color value, preferably by a RGB value, for each coordinate triple of the cosine matrix, and

expressing the depth data by a distance value for each coordinate triple of the cosine matrix.

In this way, a data set comprising a color value (as a part of the image frame) and a respective distance value (as a part of a depth map) for multiple directions relative to the origin of the automotive vehicle coordinate system may be input into the CNN and processed therein together.

In general, different types of cameras may be used. However, according to a preferred embodiment of the invention, the camera is a fish eye camera with a filed of view which is at least 180°. Further, in general, a single camera may be sufficient for the method according to the invention. However, according to a preferred embodiment of the invention, multiple cameras are used for generating the input data for the convolutional neuronal network. Preferably, these cameras have different fields of view. Even more preferably, these cameras cover the complete surrounding of the automotive vehicle.

Furthermore, preferably multiple range sensors are used for generating the input data for the convolutional neuronal network. In general, these range sensors may be of the same type. However, according to a preferred embodiment of the invention, the range sensors comprise at least two different types of range sensors, preferably at least a LIDAR sensor and at least an ultrasonic sensor. Preferably, these range sensors have different fields of view. Even more preferably, these range sensor cover the complete surrounding of the automotive vehicle.

The invention also relates to the use of a method as described above in an automotive vehicle, to a sensor arrangement for an automotive vehicle configured for performing such a method, and to a non-transitory computer-readable medium, comprising instructions stored thereon, that when executed on a processor, induce a sensor arrangement of an automotive vehicle to perform such a method.

In the drawings:

Fig. 1 schematically depicts an automotive vehicle with a sensor arrangement capturing an object according to a preferred embodiment of the invention,

Fig. 2 schematically depicts the camera coordinate system and the range

sensor coordinate system according to the preferred embodiment of the invention, and

Fig. 3 schematically depicts the automotive vehicle coordinate system

according to the preferred embodiment of the invention.

As schematically depicted in Fig. 1 , according to a preferred embodiment of the invention, in an automotive vehicle 1 a sensor arrangement 2 comprising a camera 3, an evaluation unit 4, an ultrasonic sensor 5, and a LIDAR sensor 6 is provided. As depicted by dashed lines the camera 3, the ultrasonic sensor 5, and the LIDAR sensor 6 comprises respective fields of view which overlap which each other. This allows to captures scenes with image data and depth data, respectively, which may be input into a convolutional neural network incorporated in the an evaluation unit 4 for classification of objects like the person 7 in front of the automotive vehicle 1 .

By using different types of range sensors 5, 6, i.e. an ultrasonic sensor 5 and a LIDAR sensor 6 it is possible to create multiple input depth maps with RGB image data to use a CNN network that can detect and classify objects. The application here is to use automotive sensors like the camera 3, the ultrasonic sensor 5 and the LIDAR sensor 6 to create depth information around a vehicle and combine this data with surround view image data. Therefore, such automotive sensors are preferably arranged on all sides of the automotive vehicle 1 in such a way that the complete surrounding of the automotive vehicle can be monitored. For the sake of clarity, the present preferred embodiment of the invention only refers to the three automotive sensors mentioned above as an example.

It is an important aspect of the present preferred embodiment of the invention to encode the range sensors 5, 6, i.e. the ultrasonic sensor 5 and the LIDAR sensor 6, in the same coordinate system as the camera data to create CNN input data that use RGB and multi depths maps together. This input data can then be input to a convolution neural network for classification.

As schematically depicted in Fig. 2, each sensor has its own mechanical coordinate system. Here, due to the two-dimensionality of the figure only the x-axes and the z-axes are depicted, i.e. xc and z_c for the camera 3, xu and zu for the ultrasonic sensor 5, and xi_ and zi_ for the LIDAR sensor 6. Further, as schematically depicted in Fig. 3 an automotive vehicle coordinate system is defined as a common reference coordinate system for all automotive sensors 3, 5, 6 is defined. The automotive vehicle coordinate system has its origin (0, 0, 0) in the middle of the front portion of the automotive vehicle 1 at street level. With respect to the respective positions of the automotive sensors 3, 5, 6 (and to all other automotive sensors which may be arranged on the automotive vehicle) there exists a set of rotations and translations to define the relationship between each sensor and the automotive vehicle coordinate system. All sensor data can then be translated into the automotive vehicle coordinate system as a common frame of reference and passed into the CNN.

In detail, this method according to the present preferred embodiment of the invention is as follows:

With the camera 3, image frame are consecutively acquired, the image frames being comprised of image data for directions relative to the position of the camera 3 and within the solid angle seen by the camera 3, the directions being expressed by coordinates in the camera coordinate system described above. Simultaneously depth information with the range sensors 5, 6, i.e. the ultrasonic sensor 5 and the LIDAR sensor 6 is acquired, the depth information being comprised of depth data for directions relative to the positions of the range sensors 5, 6 and within the solid angles seen by the range sensors 5, 6, the directions being expressed by coordinates in the range sensors’ coordinate systems.

As described before, an automotive vehicle coordinate system is provided which is related to the camera coordinate system and the range sensors’ coordinate systems by respective sets of translations and rotations given by the position of the camera 3 and the positions of the range sensors 5, 6 relative to the origin of the automotive vehicle coordinate system, respectively. Then, the coordinates in the camera coordinate system and the coordinates in the range sensors coordinate systems are transformed into coordinates in the automotive coordinate system on the basis of the sets of translations and rotations. In this way the input data for the convolutional neural network is yielded and input in the CNN for object classification.

According to the preferred embodiment of the invention described here, the coordinates in the camera coordinate system and the coordinates in the range sensor coordinate system are both expressed by a respective direction cosine matrix. Further, the image data is expressed by a color value, i.e. by a RGB value, for each coordinate triple of the cosine matrix, and the depth data is expressed by a distance value for each coordinate triple of the cosine matrix.

In this way, by using image information from the camera 3 together with depth information from different range sensors 5, 6 semantic segmentation of objects in an image in automotive computer vision can be greatly enhanced. Reference signs list

automotive vehicle

sensor arrangement

camera

evaluation unit

ultrasonic sensor

LIDAR sensor

person

Claims

1. Method for generating input data for a convolutional neuronal network, using a at least one camera (3) and at least one range sensor (5, 6), the camera (3) and the range sensor (5, 6) being arranged on the automotive vehicle (1 ) in such a way that the field of view of the camera (3) at least partially overlaps with the field of view of the range sensor (5, 6), the method comprising the following method steps:

acquiring an image frame with the camera (3), the image frame being comprised of image data for directions relative to the position of the camera (3) and within the solid angle seen by the camera (3), the directions being expressed by coordinates in a camera coordinate system,

simultaneously acquiring depth information with the range sensor (5, 6), the depth information being comprised of depth data for directions relative to the position of the range sensor (5, 6) and within the solid angle seen by the range sensor (5, 6), the directions being expressed by coordinates in a range sensor coordinate system,

providing an automotive vehicle coordinate system which is related to the camera coordinate system and the range sensor coordinate system by respective sets of translations and rotations given by the position of the camera (3) and the position of the range sensor (5, 6) relative to the origin of the automotive vehicle coordinate system, respectively,

transforming the coordinates in the camera coordinate system and the coordinates in the range sensor coordinate system into coordinates in the automotive coordinate system on the basis of the sets of translations and rotations yielding the input data for the convolutional neural network.

2. Method according to claim 1 , the method further comprising the following steps: expressing the coordinates in the camera coordinate system by a direction cosine matrix, and

expressing the coordinates in the range sensor coordinate system by a direction cosine matrix.

3. Method according to claim 1 or 2, the method further comprising the following steps: expressing the image data by a color value, preferably by a RGB value, for each coordinate triple of the cosine matrix, and

4. Method according to any of the preceding claims, wherein the camera (3) is a fish eye camera with a filed of view which is at least 180°.

5. Method according to any of the preceding claims, wherein multiple cameras (3) are used for generating the input data for the convolutional neuronal network.

6. Method according to any of the preceding claims, wherein multiple range sensors (5, 6) are used for generating the input data for the convolutional neuronal network.

7. Method according to claim 6, wherein the range sensors (5, 6) comprise at least two different types of range sensors, preferably at least a LIDAR sensor (6) and at least an ultrasonic sensor (5).

8. Use of the method according to any of the previous claims in an automotive vehicle (1 ).

9. Sensor arrangement (2) for an automotive vehicle (1 ) configured for performing the method according to any of claims 1 to 8.

10. Non-transitory computer-readable medium, comprising instructions stored thereon, that when executed on a processor, induce a sensor arrangement (2) of an automotive vehicle (1 ) to perform the method of any of claims 1 to 8.