GB2606339A

GB2606339A - Motion prediction with ego motion compensation and consideration of occluded objects

Info

Publication number: GB2606339A
Application number: GB2105220.4A
Authority: GB
Inventors: Meng Yan; Lippe Phillip; Dao David
Original assignee: Mercedes Benz Group AG
Current assignee: Mercedes Benz Group AG
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2022-11-09
Also published as: GB202105220D0

Abstract

The invention relates to a method for prediction of a motion of an object (7) in the environment of a vehicle (1), the method comprising: collecting sensor data of the object (7) in the environment of the vehicle (1) by at least one vehicle sensor and predicting the motion of the object (7) on the basis of the sensor data by a self-learning system (6), characterized by inputting occlusion data (8, 13) into the self-learning system (6), wherein the occlusion data (8, 13) refer to occluded objects of the environment, which are hidden for the at least one vehicle sensor and/or inputting ego motion data (16) into the self-learning system (6), wherein the ego motion data (16) refer to an ego motion of the vehicle (1), so that the motion of the object (7) is predicted taking into account the occluded objects and/or the ego motion of the vehicle (1). Also provided is a method of controlling an autonomous vehicle based on the prediction.

Description

MOTION PREDICTION WITH EGO MOTION COMPENSATION AND CONSIDERATION OF OCCLUDED OBJECTS

FIELD OF THE INVENTION

[0001] The invention relates to a method for prediction of a motion of an object in the environment of a vehicle, the method comprising collecting sensor data of the object in the environment of the vehicle by at least one vehicle sensor and predicting the motion of the object on the basis of the sensor data by a self-learning system. Furthermore, the present invention relates to a method of controlling an autonomous driving vehicle on the basis of such prediction. Additionally, the present invention relates to a device for prediction of a motion of an object in the environment of a vehicle, wherein the device comprises at least one vehicle sensor for collecting sensor data of the object in the environment of the vehicle and a self-learning system for predicting the motion of the object on the basis of the sensor data.

BACKGROUND INFORMATION

[0002] An autonomous driving vehicle has to predict the future trajectories of other traffic participants to plan its own safe and comfortable route. For this purpose a deep learning-based system with sensor inputs can be used to tackle this motion prediction challenge. More specifically, the present invention focuses on solving two challenging problems in motion prediction. First, on-board sensors on autonomous driving vehicles, such as cameras, lidars and radars, can only partially observe the surrounding environment due to the limitation of the sensor ranges and occlusions from other objects. Therefore, a first problem to solve is how to predict motion of those fully or partially occluded objects. Since we need to conduct motion prediction while the ego vehicle is moving, the second problem is how to compensate the ego motion while executing prediction on surrounding objects.

[0003] Document US 2019 00499 70 Al discloses an object motion prediction and autonomous vehicle control. A computer-implemented method includes obtaining state data indicative of at least a current or a past state of an object that is within a surrounding environment of an autonomous vehicle. The method includes obtaining data associated with a geographic area in which the object is located. The method includes generating a combined data set associated with the object based at least in part on a fusion of the state data and the data associated with the geographic area in which the object is located. The method includes obtaining data indicative of a machine-learned model. Furthermore, the method includes inputting the combined data set into the machine-learned model. An output from the machine-learned model is received, wherein the output can be indicative of a plurality of predicted trajectories of the object.

[0004] Moreover, document US 2019 0025841 Al discloses systems and methods for predicting the future locations of objects that are perceived by autonomous vehicles. An autonomous vehicle can include a prediction system that, for each object perceived by the autonomous vehicle, generates one or more potential goals, selects one or more of the potential goals, and develops one or more trajectories by which the object can achieve the one or more selected goals. The prediction systems and methods described can include or leverage one or more machine-learned models that assist in predicting the future locations of the objects. The prediction system may include a machine-learned static object classifier, a machine-learned goal-scoring model, a machine-learned trajectory development model, a machine-learned ballistic quality classifier, and/or other machine-learned models. The use of machine-learned models can improve the speed, quality, and/or accuracy of the generated predictions.

[0005] The above disclosures just mention that in general machine-learning systems and methods can be applied for predicting the future locations of objects that are perceived by autonomous vehicles. Neither of them includes any detailed machine-learning design models or methods that can solve the specific multi-agent multi-modal problems in a highly dynamic environment.

SUMMARY OF THE INVENTION

[0006] The object of the present invention is to provide a method and a device for motion prediction dealing with the problem of the movement of an ego vehicle and the fully or partially occlusion of other objects from the few of current sensors on the autonomous driving vehicle.

[0007] This object is solved by a method according to claim 1 and a device according to claim 7. Further favorable developments are defined in the sub claims.

[0008] Accordingly, there is provided a method for prediction of a motion of an object in the environment of a vehicle, the method comprising: collecting sensor data of the object in the environment of the vehicle by at least one vehicle sensor and predicting the motion of the object on the basis of the sensor data by a self-learning system, characterized by-inputting occlusion data into the self-learning system, wherein the occlusion data refer to occluded objects of the environment, which are hidden for the at least one vehicle sensor and/or inputting ego motion data into the self-learning system, wherein the ego motion data refer to an ego motion of the vehicle, so that the motion of the object is predicted taking into account the occluded objects and/or the ego motion of the vehicle.

[0009] In other words, in a first step sensor data of the object are collected by the sensors of the vehicle, which is preferably an autonomous driving vehicle. In a very simple embodiment only one vehicle sensor is used for collecting the sensor data. However, usually a plurality of sensors of the vehicle is used specifically based on video techniques, radar techniques, ultrasonic techniques etc. The entire sensor data may be processed by a specific processor of the vehicle. In one embodiment the collected sensor data of the different techniques are correlated in order to improve the quality of the entirety of the data.

[0010] In a following step the motion of the object is predicted by using the sensor data. Thus, the sensor data may be used in a raw form or in a preprocessed form in order to predict the motion of the object. Specifically, the prediction can be calculated by a self-learning system. This means that the results of the system improve with each learned data set. Specifically, the prediction of the motion can be optimized individually by the learned data sets.

[0011] Occlusion data are input into the self-learning system. Occlusion data describe objects in an area in the environment of the vehicle, wherefrom the sensors of the vehicle do not receive direct signals. For instance, a car in the foreground can obscure a sidewalk in the background. In this case occlusion data of the sidewalk should be gathered. For instance, a top view of the environment is generated by the prediction system. This top view may show a part of a sidewalk being hidden by the car. Therefore, if a pedestrian walks on the sidewalk and disappears behind the car, there is a high probability that he will follow the sidewalk and appear again after passing the car. In this case the occlusion data refer to the sidewalk behind the car which is not observable by the vehicle sensors. Thus, the occlusion data refer to occluded objects of the environment which are hidden for at least one of the sensors.

[0012] Alternatively or additionally ego motion data may be input into the self-learning system. This means that motion data of the own vehicle are provided and input into the self-learning system. Such ego motion data may be obtained from any controlling systems of the vehicle. The ego motion data may be updated continuously. Thus, actual motion data can be provided in a specific memory, for instance.

[0013] As a result, the motion of the object is predicted, wherein the output of the prediction may consist of tracked and recovered objects including respective proposed trajectories. The prediction takes into account occluded objects in one alternative. According to another alternative the ego motion of the vehicle is considered for the prediction. In a preferred embodiment both alternatives are combined.

[0014] In a preferred embodiment the system may comprise a deep learning neural network. However, the self-learning systems may also include other learning algorithms.

[0015] In another preferred embodiment the occlusion data is input into a first layer of the neural network and the ego motion data is input into a second layer of the neural network wherein the second layer is scaled lower than the first layer. Specifically, the first layer may be the absolutely first layer of the neural network and the second layer may be a deeper layer like the absolute third layer of the neural network. Thus, the prediction may be conditioned on an input action. Specifically, the ego motion (state and action) can be added to a predefined layer or feature map of the neural network.

[0016] In a further embodiment a long-term prediction is performed, wherein a keep-alive function is used to reduce the input noise.

[0017] According to a still further embodiment the ego motion is compensated by transforming output images with regard to an ego vehicle velocity and yaw rate.

[0018] The above object is also solved by a method of controlling an autonomous driving vehicle on the basis of predicting a motion of an object in the environment of the vehicle according to the above-described methods. Specifically, the motion prediction and the prediction of future trajectories of other traffic participants can be used to plan a safe and comfortable route of the autonomous vehicle.

[0019] Additionally, the above object is also solved by a device for prediction of a motion of an object in the environment of a vehicle, the device comprising: at least one vehicle sensor for collecting sensor data of the object in the environment of the vehicle and a self-learning system for predicting the motion of the object on the basis of the sensor data, characterized by input means for inputting occlusion data into the self-learning system, wherein the occlusion data refer to occluded objects of the environment, which are hidden for at least one vehicle sensor and/or for inputting ego motion data into the self-learning system, wherein the ego motion data refer to an ego motion of the vehicle, so that the motion of the object is predictable taking into account the occluded objects and/or the ego motion of the vehicle.

[0020] The above passages describe advantages and modifications of the inventive method. These advantages and modifications may also apply to the inventive device.

[0021] Further advantages, features, and details of the invention derive from the following description of preferred embodiments as well as from the drawings. The features and feature combinations previously mentioned in the description as well as the features and feature combinations mentioned in the following description of the figures and/or shown in the figures alone can be employed not only in the respectively indicated combination but also in any other combination or taken alone without leaving the scope of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

[0022] The novel features and characteristic of the disclosure are set forth in the appended claims. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and together with the description, serve to explain the disclosed principles. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the figures to reference like features and components. Some embodiments of system and/or methods in accordance with embodiments of the present subject matter are now described below, by way of example only, and with reference to the accompanying figures.

[0023] The drawings show in: [0024] Fig. 1 a mapping of a 3D-view to a 2D-top-view grid map-based image.

[0025] Fig. 2 a velocity map; [0026] Fig. 3 a semantic map; [0027] Fig. 4 a horizon map; [0028] Fig. 5 an overall prediction system architecture; [0029] Fig. 6 a proposed deep neural network architecture; [0030] Fig. 7 a structure of a dense block with four delated convolutions increasing their rate exponentially over depth.

[0031] Fig. 8 a training sequence; and [0032] Fig. 9 a generator predicting the next frame yt+i based on the previous frame xt.

[0033] In the figures the same elements or elements having the same function are indicated by the same reference signs.

DETAILED DESCRIPTION

[0034] In the present document, the word "exemplary" is used herein to mean "serving as an example, instance, or illustration". Any embodiment or implementation of the present subject matter described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

[0035] While the disclosure is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawing and will be described in detail below. It should be understood, however, that it is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternatives falling within the scope of the disclosure.

[0036] The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion so that a setup, device or method that comprises a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or device or method. In other words, one or more elements in a system or apparatus preceded by "comprises" or "comprise' does not or do not, without more constraints, preclude the existence of other elements or additional elements in the system or method.

[0037] In the following detailed description of the embodiment of the disclosure, reference is made to the accompanying drawing that forms part hereof, and in which is shown by way of illustration a specific embodiment in which the disclosure may be practiced. This embodiment is described in sufficient detail to enable those skilled in the art to practice the disclosure, and it is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the present disclosure. The following description is, therefore, not to be taken in a limiting sense.

[0038] In a specific embodiment of the present invention a deep learning-based prediction system may be used for an autonomous driving vehicle. In order to plan a safe, comfortable and efficient future behavior, the ego autonomous driving vehicle needs to predict the future behaviors of other traffic participants around it. The prediction system may also be used for other non-autonomous driving vehicles systems such as a driver assistance system. The traffic participants are also called objects in the following description.

[0039] In a preprocessing step for better handling the overall environment around the ego vehicle, all raw sensor inputs may be mapped from a 3D world to 2D top-view images as shown in Fig. 1. The left drawing in Fig. 1 shows the 3D world and the right drawing is the corresponding 2D top-view image. The autonomous car, i.e., the ego vehicle 1 is mapped to a rectangle in the 2D view. The oncoming track 2 and the car 3 in front of the ego vehicle 1 are also mapped to rectangles in the 2D view. The rectangles are positioned on the respective locations of the road 4.

[0040] The autonomous driving vehicle 1 may take input images by on-board sensors such as cameras, lidars and radars. The on-board sensors deliver raw data in multiple channels as shown in Figures 2 to 4. Specifically, Fig. 2 shows a velocity map, Fig. 3 a semantic map and Fig. 4 a horizon map. Each map shows the ego vehicle 1 in the center. The raw-images from the sensors may be preprocessed and fused together to a top-view grid-map based image with multiple channels.

[0041] As shown in Fig. 2, the dynamic objects may be encoded by their velocity and orientation using a color wheel. In this figure the static objects are represented by black color to summarize all detected objects. The semantic map of Fig. 3 classifies the pixels in a plurality of object classes using different colors for example. The horizon map of Fig. 4 defines the structure of the road to get an overview where the vehicle can drive.

[0042] Fig. 5 shows the overall network-based prediction system architecture. The network system proposed here is built to perform both tracking and prediction of surrounding agents or objects. An input frame 5 for the prediction system 6 shows the ego vehicle 1. Furthermore, it shows a plurality of objects and their occlusions in the form of shadows from the view of the ego vehicle. Specifically, there is an object 7 in the center of the frame 5 which may be a car. The area behind the car from the view of the ego vehicle 1 is an occlusion area 8. Sensors of the ego vehicle 1 cannot detect objects in this occlusion area 8. Since they are obscured by object 7.

[0043] To predict more than one future frame, the output frame 9 maybe fed back into the network to get the further predictions. Past sequences of the grid map-based image frames 5 are given to the prediction system 6 as inputs. These frames contain occlusion areas 8 as black areas, occupancies (e.g. objects like cars 7) as white areas and the horizon map (compare Fig. 4) as background, for instance. The output of the prediction system 6 may consist of the tracked and recovered objects 7 including the proposed trajectories 10 the objects may take. In the complex intersection of output frame 9 in Fig. the pedestrian (trajectories 11) will cross the street while cars 7 (trajectories 10) have to stop. However, car 12 at the top will continue driving.

[0044] Fig. 6 shows an example of a proposed network architecture. The input to the network consists of an occupancy and occlusion map 13. In addition, all objects in the traffic environment are constrained to a road structure including lanes. In the specific example the network consists of an encoder-decoder structure with four ConvLSTMs (Convolutional long short-term memory) layers. The feature maps are represented by rectangles in Fig. 6 on which different layers are applied. A dense block 15 with delated convolution as shown in Fig. 7 is used on the lowest scale to maximize the receptive field, and get better recognition of interaction over long distance. Since the predictions are conditioned on a certain input action, the ego vehicle's state and action are added as additional channel 16 to the third feature map 14. In addition, residual connections and deconvolutions are used in the decoder which ends up in a one-channel prediction 17 (e.g. related to occupancy) of the input resolution.

[0045] The dense block 15 shown in Fig. 7 may comprise the following convolutional steps: Step 151: delated convolution 3x3, rate 1 Step 152: convolution 1x1, stride 1 Step 153: delated convolution 3x3, rate 2 Step 154: convolution 1x1, stride 1 Step 155: delated convolution 3x3, rate 4 Step 156: convolution 1x1, stride 1 Step 157: delated convolution 3x3, rate 8 and Step 158: convolution 1x1, stride 1.

[0046] In the present example the dense block 15 has four delated convolutions increasing their rate exponentially over depth. All layers are connected with each other in a feet-forward manner while 1x1 convolutions are used for reducing the channel size.

[0047] The convolutional steps of Fig. 7 should be understood as an example only. The number and kind of convolutional steps may vary.

[0048] In the following the functions of ego motion compensation and long-term prediction will be explained. When the ego vehicle moves, the predicted object motions have to regard the motion of the ego vehicle because it is centered in every frame. Even though the actions of the ego vehicle are an additional input to the network (as shown in Fig. 6), it is still hard to predict this correlation. For example, we assume two cars driving next to the ego vehicle. The ego vehicle slows down and stops. The output would look like the two cars are highly speeding up although they are not changing their velocities. In addition, all static objects around the ego vehicle would move inverted to its motion so that the model has to track much more objects.

[0049] Another important feature for the prediction system is to be able to conduct longterm prediction (compare Fig. 8). The training sequence alternates between giving n=2 ground truth xt+1:1+1+2 and m=2 prediction frames yt+ 1+2:1+ 1+4 as input. While predicting further time steps a keep-alive function 18 is used to reduce the input noise. The output and all hidden states of the network are transformed by a Spatial Transformer Modul (STM) based on the ego vehicle motion. A strategy for compensating the ego motion is transforming the output images with regard to the ego vehicle velocity and yaw rate. Standard 2D affine transformation functions are not differentiable so that a Spatial Transformer Modul (STM) is the better solution. The STM applies a point-wise transformation on the image as follows: Usually the transformation matrix Ao is learned by a localization network, but in this case the matrix is already defined. Given t as the period duration of a frame, v as yaw rate and v as the velocity, the transformation matrix is approximately determined by: C(xSey At.) Not only the output depends on spatial information, the hidden states have to be transformed regarding to the ego motion so that the model can still track objects. Summarized STM is applied on the output and all hidden states of the network.

[0050] To summarize the loss calculation Fig. 9 gives and overview of the combination and application of all losses. The input frame xt is given to the generator which predicts the next frame. To compensate the ego vehicle motion, a Spatial Transformer Module (STM) is applied on it to get the final generated frame yto. The prediction is masked out by the ground truth occlusion map 19 from x1+1 and provided as input (occupancy map 20) to the Binary Cross Entropy (BCE) loss and the recurrent discriminator D. The sharpening loss SHARP does neither need the masking nor the ground truth and so only takes the predicted frame. All of them are rated and summed up to get the overall loss.

Reference Signs 1 vehicle 2 object 3 car 4 road input frame 6 self-learning system / prediction system 7 objects 8 occlusion areas 9 output frame trajectories 11 trajectories 12 car 13 occlusion data 14 feature map dense block 16 ego motion data 17 one-channel prediction 18 keep-alive function 19 occlusion map occupancy map

Claims

CLAIMS1. A method for prediction of a motion of an object (7) in the environment of a vehicle (1), the method comprising: - collecting sensor data of the object (7) in the environment of the vehicle (1) by at least one vehicle sensor and - predicting the motion of the object (7) on the basis of the sensor data by a self-learning system (6), characterized by - inputting occlusion data (8, 13) into the self-learning system (6), wherein the occlusion data (8, 13) refer to occluded objects of the environment, which are hidden for the at least one vehicle sensor and/or - inputting ego motion data (16) into the self-learning system (6), wherein the ego motion data (16) refer to an ego motion of the vehicle (1), - so that the motion of the object (7) is predicted taking into account the occluded objects and/or the ego motion of the vehicle (1).
2. The method according to claim 1, characterized in that the self-learning system (6) comprises a deep learning neural network (14, 15).
3. The method according to claim 2, characterized in that the occlusion data (8, 13) is input into a first layer of the neural network (14, 15) and the ego motion data is input into a second layer of the neural network (14, 15), wherein the second layer is scaled lower than the first layer.
4. The method according to any one of claims 1 to 3, characterized in that a long-term prediction is performed, wherein a keep-alive function (18) is used to reduce the input noise.
5. The method according to any one of claims 1 to 4, characterized in that the ego motion is compensated by transforming output images with regard to an ego vehicle velocity and yaw rate.
6. A method of controlling an autonomous driving vehicle (1) on the basis of predicting a motion of an object (7) in the environment of a vehicle (1) according to any one of claims 1 to 4.
7. A device for prediction of a motion of an object (7) in the environment of a vehicle (1), the device comprising: - at least one vehicle sensor for collecting sensor data of the object (7) in the environment of the vehicle (1) and - a self-learning system (6) for predicting the motion of the object (7) on the basis of the sensor data, characterized by - input means for inputting occlusion data into the self-learning system (6), wherein the occlusion data refer to occluded objects of the environment, which are hidden for the at least one vehicle sensor and/or for inputting ego motion data into the self-learning system (6), wherein the ego motion data refer to an ego motion of the vehicle (1), - so that the motion of the object (7) is predictable taking into account the occluded objects and/or the ego motion of the vehicle (1).