WO2020239632A1

WO2020239632A1 - Method and apparatus for safely predicting a trajectory

Info

Publication number: WO2020239632A1
Application number: PCT/EP2020/064311
Authority: WO
Inventors: Konrad Groh
Original assignee: Robert Bosch Gmbh
Priority date: 2019-05-29
Filing date: 2020-05-22
Publication date: 2020-12-03
Also published as: DE102019207947A1

Abstract

A computer-implemented method for classifying future trajectory profiles for objects within an image (x) captured by a sensor (30), having the steps of: 1) ascertaining a textureless representation (SEM), in particular a semantic segmentation, of the image (x); 2) identifying objects within the image (x); 3) ascertaining kinetic variables, that is to say variables that characterize their kinetic state, such as in particular present kinetic variables, for the identified objects; and 4) assigning the ascertained kinetic variables of the identified objects to one class from a predefinable plurality of classes.

Description

description

Method and device for the reliable prediction of a trajectory

The invention relates to a method for classifying future trajectories of objects within an image captured by a sensor, an intention estimator for carrying out this method, a method for training the intention estimator, a computer program and a machine-readable storage medium.

State of the art

DE 10 2017 223 264.1, which was not previously published, discloses a method for detecting an object in a provided input signal, an object being detected as a function of the provided input signal and an actuator being controlled as a function of the detection of the object.

Advantages of the invention

A challenge in the reliable classification of video recordings of a scene is to make reliable predictions about a future time course of trajectories in the objects of the scene, that is to say to determine an intention of the objects of the scene.

This enables a large number of training data for training which, as real recorded data, is often not available with the required variability in order to be able to ensure that all possible constellations are covered. In contrast, the method with the features of independent claim 1 has the advantage that it can be trained with artificially generated test data, that is to say on a computer, so that a large number of training data can be provided with ease.

Further aspects of the invention are the subject of the independent claims. Advantageous further developments are the subject of the dependent claims.

Disclosure of the invention

In a first aspect, the invention relates to a computer-implemented method for classifying future trajectory courses of objects within an image captured by a sensor, with the steps:

1) determining a textureless representation, in particular a semantic segmentation, of the image;

2) Identifying objects within the image, in particular by means of the textureless representation;

3) Determination of kinetic variables, that is to say variables that characterize their kinetic state, such as in particular current kinetic variables, of the identified objects; and

4) Assigning the determined kinetic variables of the identified objects to a class of a predeterminable plurality of classes.

This is because it was recognized that the intermediate step of determining a textureless representation of the image allows such a classifier (also referred to below as: intention estimator) to be trained with artificially generated data. The invention makes use of the knowledge that the step of deriving a textless representation from an image represents a sub-problem for which sufficient amounts of training data can very easily be made available, and that textureless representation is comparatively easily verifiable realistically from one Computer generated.

Kinetic variables of an object can be understood to mean positions and speeds that allow a movement of the object to be fully described as a rigid body, in particular, and include both translative and rotational variables. In a further development of this aspect, it is provided that the determination of kinetic variables of the identified objects takes place from a plurality of, in particular, immediately successive images of a captured sequence of images. The sequence can for example be given by successive frames of a video recording. It is then advantageous if a sequence of textureless representations of the respective images is determined for the sequence of images, and the kinetic variables of the identified objects are determined from the sequence of textureless representations.

If there is a sequence of images, it makes sense to determine a sequence of respective kinetic variables, and the assignment to the class is made as a function of the sequence of kinetic variables. This is because it was recognized that conclusions about a future course can be drawn from the previous time course, that is to say the previous sequence of kinetic variables.

In a further development, it can be provided that the assignment to the class is made by means of a plurality of variables characterizing clusters, in particular cluster centers and cluster radii, which were determined by means of a cluster algorithm on a cluster training data set. I.e. It can be provided that depending on the cluster centers and cluster radii for a provided temporal course of kinetic variables that one of the clusters determined in the cluster training is selected to which the provided temporal course of kinetic variables is most likely associated. Such an assigned cluster can be referred to as the estimated intention of the detected objects.

In a further aspect, the invention relates to an intention estimator for classifying future trajectory courses of objects within the image captured by the sensor, which is set up to carry out one of the aforementioned methods, comprising:

- A segmenter, in particular a machine learning method such as a new ronal network, which is set up to determine the texture-free representation of the image; - An object detector, in particular a machine learning method such as a neural network, which is set up to identify objects within the image and to determine kinetic variables of the identified objects; and

an estimator that is set up to assign the kinetic variables of the identified objects to a class of a predeterminable plurality of classes.

In other words, the object detector can be set up to provide a list of the objects identified therein and the associated kinetic variables for an image to be fed to it.

In a further aspect, the invention relates to a computer-implemented method for training this intention estimator, comprising the steps:

- generating a plurality of scenes;

- Generation of texture-free representations (such as semantic segmentations) in accordance with images of the scene recorded from a prescribable camera position;

- Providing a training data set for training the intention estimator, comprising the generated texture-free representations and

a) setpoint values of the kinetic variables generated from the respective scenes, and / or

b) setpoint values generated from the respective scenes of the objects visible in the texture-free display from the specifiable camera position.

The object detector can then be trained by means of the training data set.

In this case, a scene includes, in particular, a description of a predeterminable position of a video camera (as it can for example be mounted in a motor vehicle) and an (abstract) description of the motor vehicle. In particular, there can be a topography around the video camera, a course of a street, positions and orientations of moving or immovable objects, types of objects, a temporal course of the movements of the moving objects and a position and orientation as well as an ego movement of the video camera.

From this predeterminable position of the video camera, it is then possible, for example, with a rendering method for a predeterminable sequence of times a texture-free representation of the scene from the point of view of the video camera is generated.

With such a data set it is possible to train the object detector and thus also the intention estimator with synthetically generated (i.e. computer-generated) training data.

It is particularly advantageous here if the time profiles of the kinetic variable are given by time profiles of a corresponding jolt, that is to say a time derivative of an acceleration. By using the jerk, it is possible to obtain realistic temporal progressions of the objects through temporal integration, without the temporal progression of the jerk itself having to meet complex requirements.

In further aspects, the invention relates to a computer program which is set up to carry out the above methods and a machine-readable storage medium on which this computer program is stored.

Embodiments of the invention are explained in more detail below with reference to the accompanying drawings. In the drawings show:

FIG. 1 schematically shows a structure of an embodiment of the invention;

FIG. 2 schematically shows an exemplary embodiment for controlling an at least partially autonomous robot;

FIG. 3 schematically shows an exemplary embodiment for controlling a manufacturing system;

FIG. 4 schematically shows an exemplary embodiment for controlling an access system;

FIG. 5 schematically shows an exemplary embodiment for controlling a monitoring system; FIG. 6 shows an exemplary segmentation of a scene;

FIG. 7 shows, by way of example, a time sequence of illustrated semantic

Segmentations of the scene;

FIG. 8 shows an exemplary course of a jerk;

FIG. 9 shows an exemplary structure of the intention estimator;

FIG. 10 shows an exemplary structure of the classifier;

FIG. 11 shows a possible structure of a training device.

Description of the exemplary embodiments

FIG. 1 shows an actuator 10 in its environment 20 in interaction with a control system 40. The environment 20 is detected with a sensor 30, which can also be provided by a plurality of sensors, at preferably regular time intervals. The sensor signal S - or, in the case of a plurality of sensors, one sensor signal S each - from the sensor 30 is transmitted to the control system 40. The control system 40 thus receives a sequence of sensor signals S. The control system 40 uses this to determine control signals A, which are transmitted to the actuator 10.

The control system 40 receives the sequence of sensor signals S from the sensor 30 in an optional receiving unit 50, which converts the sequence of sensor signals S into a sequence of input images x (alternatively, the sensor signal S can also be directly adopted as the input image x). The input image x can be a section or further processing of the sensor signal S, for example. The input image x can, for example, image data o of images, or individual frames of a video recording. In other words, the input image x is determined as a function of the sensor signal S. The sequence of input images x is fed to the intention estimator 60.

Intention estimator 60 is preferably parameterized by parameters f, which are stored in a parameter memory P and are provided by this.

The intention estimator 60 determines output variables y from the input images x. Output variables y are fed to an optional conversion unit 80, which uses this to determine control signals A which are fed to the actuator 10 in order to control the actuator 10 accordingly.

The actuator 10 receives the control signals A, is controlled accordingly and carries out a corresponding action. The actuator 10 can include control logic (not necessarily structurally integrated), which determines a second control signal from the control signal A, with which the actuator 10 is then controlled.

In further embodiments, the control system 40 includes the sensor 30. In still further embodiments, the control system 40 alternatively or additionally also includes the actuator 10.

In further preferred embodiments, the control system 40 comprises one or a plurality of processors 45 and at least one machine-readable storage medium 46 on which instructions are stored which, when executed on the processors 45, cause the control system 40 to use the method according to the invention execute.

In alternative embodiments, a display unit 10a is provided as an alternative or in addition to the actuator 10.

FIG. 2 shows how the control system 40 can be used to control an at least partially autonomous robot, here an at least partially autonomous motor vehicle 100. The sensor 30 can be, for example, one or more imaging sensors, such as one or more video sensors, preferably arranged in the motor vehicle 100.

The intention estimator 60 is set up to determine from the input images x an analysis of the scene y comprising a prognosis of safe areas that is dependent on the determined intentions of detected objects.

The actuator 10, which is preferably arranged in the motor vehicle 100, can be, for example, a brake, a drive or a steering system of the motor vehicle 100. The control signal A can then be determined in such a way that the actuator or actuators 10 is controlled in such a way that the motor vehicle 100 prevents, for example, a collision with the objects identified by the intention estimator 60, in particular when they are objects of certain classes, e.g. pedestrians. In other words, control signal A can be determined depending on the determined class and / or according to the determined class and according to the determined intention of the object.

Alternatively, the at least partially autonomous robot can also be another mobile robot (not shown), for example one that moves by flying, swimming, diving or striding. The mobile robot can also be, for example, an at least partially autonomous lawnmower or an at least partially autonomous cleaning robot. In these cases, too, the control signal A can be determined in such a way that the drive and / or steering of the mobile robot is controlled in such a way that the at least partially autonomous robot prevents, for example, a collision with objects identified by the intention estimator 60.

The control signal A can then be determined in such a way that the actuator or actuators 10 are controlled in such a way that the motor vehicle 100 does not leave the safe areas characterized on the basis of output variable y.

As an alternative or in addition, the control signal A can be used to control the display unit 10a and, for example, the ascertained safe areas can be displayed. It is also not included in a motor vehicle 100, for example automated steering possible that the display unit 10a is controlled with the control signal A in such a way that it outputs an optical or acoustic warning signal when it is determined that the motor vehicle 100 threatens to leave the safe areas.

FIG. 3 shows an exemplary embodiment in which the control system 40 is used to control a production machine 11 of a production system 200, in that an actuator 10 controlling this production machine 11 is controlled. The manufacturing machine 11 can be, for example, a machine for punching, sawing, drilling and / or cutting.

The sensor 30 can then be, for example, an optical sensor, e.g. Properties of manufactured products 12 recorded. It is possible for these manufactured products 12 to be movable. It is possible for the actuator 10 controlling the production machine 11 to be controlled depending on the predicted movement determined, i.e. the intention, of the production product 12 so that the production machine 11 accordingly executes a subsequent processing step of this production product 12.

FIG. 4 shows an exemplary embodiment in which the control system 40 is used to control an access system 300. The access system 300 may include a physical access control, for example a door 401.

The sensor 30 can be, for example, an optical sensor (for example for capturing image or video data) that is set up to capture a person. This captured image can be interpreted by means of the intention estimator 60. For example, the identity of this person and the intention of the person can be determined. The actuator 10 can be a lock that, depending on the control signal A, enables the access control or not, for example the door 401 opens or not. To this end, the control signal A can be selected as a function of the interpretation of the intention estimator 60, for example as a function of the identified identity and / or intention of the person. Instead of the physical access control, a logical access control can also be provided. FIG. 5 shows an exemplary embodiment in which the control system 40 is used to control a monitoring system 400. This exemplary embodiment differs from the exemplary embodiment illustrated in FIG. 5 in that, instead of the actuator 10, the display unit 10a is provided, which is controlled by the control system 40. For example, the intention estimator 60 can determine whether an object picked up by the optical sensor is suspect, and the control signal A is then selected such that this object is shown highlighted in color by the display unit 10a.

FIG. 6 shows an exemplary semantic segmentation of a scene sz. A street st is shown, on which a first object objl is located. Such a scene sz can for example be generated by a renderer.

FIG. 7 shows, by way of example, a time sequence of illustrated semantic segmentations of the scene sz. The street st is shown, on which the first object objl and a second object obj2 are located. As can be seen, the first object objl moves towards the camera in the chronological sequence of the images in FIGS. 7a), 7b) and 7c), while the second object obj2 moves away from it.

FIG. 8 shows an exemplary course of a jerk r of one of the objects objl, obj2 from the scene shown in FIG. 7 over time t. Time t and jerk r are advantageously discretized in each case at fixed intervals. By specifying such temporal courses of the jerk r, the temporal course of the objects in the scene shown in FIG. 7 can be described.

FIG. 9 shows an exemplary structure of the intention estimator 60. The sequence of input images x is fed to it, where they are first processed by a classifier 64. Classifier 64 determines from the sequence of input images x a sequence of semantic segmentations SEM of the input images of the x and a sequence of classifications i a list of objects obj which were detected in the input images x. These are fed to an integrator 65 which determines the output variable y from them. The classification i can, for example, have been determined by means of a cluster algorithm, and thus a possible one prototypical future behavior of the associated object obj. This prototypical behavior is then coded in the output variable y.

FIG. 10 shows an exemplary structure of the classifier 64. This is supplied with a sequence of input images x at successive times k, k + 1, k + 2, i.e. first input image x _{k + 1} , second input image x _{k + 2} and third input image x _{k + 3} . The classifier 64 comprises a segmenter 61 to which the corresponding input image x _k , x _{k + 1} , x _{k + 2 is} fed at the respective point in time and to which the associated semantic segmentation is derived from it

SEM _k , SEM _{k + 1} , SEM _{k + 2} determined. This is fed to the object detector 62, which identifies the visible objects from two successive semantic segmentations and the kinetic variables kin associated with the identified objects (i.e. position, orientation and speeds as well as type), in this case the first kinetic variable kin _{k + 1} and the second kinetic variable kin _{k + 2} . This sequence of kinetic quantities kin is fed to the estimator 63, which determines the associated classification i therefrom.

The estimator 63 can be trained with a large number of provided training courses of kinetic quantities kin. By means of a cluster algorithm (e.g. k-means), the time courses are clustered, and quantities characterizing these clusters are stored in the estimator 63 for the identified clusters. Estimator 63 can then for a provided time course of the kinetic variable kin e.g. select the characterizing variable that has the smallest distance to the temporal course of the kinetic variable kin. The classification i can then be selected as a number that characterizes this variable. The characterizing variables determined in the cluster algorithm are preferably also stored in the integrator 65 in order to be included in the determination of the output variable y.

FIG. 11 shows a possible construction of a training device 140 for training the intention estimator 60. This is parameterized with parameters f which are provided by a parameter memory P.

Exercise device 140 includes a generator 71 that generates a plurality of scenes sz. These are fed to a renderer 72, which generates a Sequence of semantic segmentations SEM determined. These are fed directly to the object detector 62 of the intention estimator 60. Subsequently, the intention estimator 60 determines from the sequence of semantic segmentations SEM the sequence of kinetic variables kin, a list of detected objects obj and an associated classification i. These are fed to a comparator 74.

With the generated scene sz, the generator 71 also supplies the associated list of objects as a target object list objs and the associated list of kinetic variables as target values of the kinetic variables kins to the comparator 74.

Depending on a correspondence between objects obj and target object list objs and a correspondence between the kinetic variables kin and the corresponding target values kins, new parameters f 'are determined, which are transmitted to the parameter memory P and replace parameter f there.

If the object detector 62 is, for example, a neural network, this can be done by determining gradients to minimize a predeterminable cost function and backward propagation.

In this case, to determine the correspondence between objects obj and target object list objs, provision can be made for the objects in the target object list objs and the objects obj (including probabilities) to be assigned to one another (i.e. to solve an association problem). A regression error of the parameters of the objects is then added to the cost function, through whose optimization the new parameters f 'are determined. This regression error can e.g. be given by a sum of squares of differences in speeds, acceleration and positions.

Finally, the object probabilities are used to evaluate false positives / false negatives. This means that an object was recognized even though no object is there, and an object was overlooked.

In the exemplary embodiment, a maximum number of visible objects can be assumed. This means that the probabilities of a specifiable number, for example 100, of possible objects in the scene are calculated. If this probability then exceeds a predeterminable threshold value, these candidates are identified as identified objects, which ensures that no object is overlooked. It is then possible for a certain regression error to be divided by the object probability for all objects recognized; the regression error is multiplied by the probability for all other incorrectly recognized objects. The methods executed by the training system 140 can be implemented as a computer program stored on a machine-readable storage medium 146 and executed by a processor 145.

The term “computer” encompasses any device for processing specifiable arithmetic rules. These calculation rules can be in the form of software, or in the form of hardware, or also in a mixed form of software and hardware.

List of reference symbols

A control signal

P parameter memory

S sensor signals

SEM semantic segmentation

SEM _k first semantic segmentation

SEM _{k + 1} second semantic segmentation third semantic segmentation

SEMs target segmentation

i classification

kin kinetic quantity

kins Setpoint of the kinetic variable kin _{k + 1} first kinetic variable

second kinetic quantity obj object

objs target object list

r jerk

st street

sz scene

t time

x input image

x _k first input image

x _{k + 1} second input image

x _{k + 2} third input image

y output variable

f parameters

f 'new parameters

10 actuator

10a display unit

11 manufacturing machine

12 Manufactured Product

20 environment

30 sensor

40 control system

45 processor 46 machine-readable storage medium

50 receiving unit

60 intention estimator

61 segmenter

62 Object Detector

63 estimators

64 classifier

65 integrator

71 generator

72 renderers

74 comparators

80 forming unit

100 motor vehicle

140 exercise device

145 processor

146 machine-readable storage medium

200 manufacturing system

249 users

250 personal assistant

300 access system

400 surveillance system

401 door

Claims

Expectations

1. Computer-implemented method for classifying future trajectories of objects (obj) within an image (x) captured by a sensor (30), with the following steps:

1) determining a textureless representation (SEM), in particular a semantic segmentation, of the image (x);

2) identifying objects within the image (x);

3) Determination of kinetic variables (kin), that is to say variables that characterize their kinetic state, such as, in particular, current kinetic variables of the identified objects; and

4) Assignment of the determined kinetic variables (kin) of the identified objects (obj) to a class (i) of a predeterminable plurality of classes.

2. The method according to claim 1, wherein the determination of kinetic variables of the identified objects takes place from a plurality of images, in particular immediately following one another, of a captured sequence of images.

3. The method according to claim 2, wherein a sequence of textureless representations (SEM) of the respective images (x) is determined for the sequence of images (x), and wherein the determination of the kinetic variables (kin) of the identified objects (obj ) from the sequence of textureless representations (SEM).

4. The method according to claim 2 or 3, wherein a sequence of respective kinetic variables (kin) is determined for the sequence of images (x), and the assignment to the class (i) takes place depending on the sequence of kinetic variables (kin) .

5. The method according to claim 4, wherein the assignment to the class (i) by means of a plurality of variables characterizing clusters, in particular clusters ter centers and cluster radii, which were determined by means of a cluster algorithm on a cluster training data set.

6. Intention estimator (60) for classifying future trajectory courses of objects within the image (x) captured by the sensor (30), which is set up to carry out the method according to one of claims 1 to 5, comprising:

- A segmenter (61) which is set up to determine the textureless representation (SEM) of the image (x);

- An object detector (62) which is set up to identify objects (obj) within the image (x) and to determine kinetic variables (kin) of the identified objects (obj); and

- An estimator (63) which is set up to assign the kinetic variables (kin) of the identified objects (obj) to a class (i) of a specifiable plurality of classes.

The computer-implemented method of training the intention estimator (60) of claim 6, comprising the steps of:

- Generating a plurality of scenes (sz);

- Generation of texture-free representations (SEM) correspondingly from a prescribable camera position from recorded images of the scene (sz);

- Providing a training data set for training the intention estimator (60), comprising the generated texture-free representations (SEM) and

a) setpoint values of the kinetic variables (kins) generated from the respective scenes, and / or

b) setpoint values generated from the respective scenes of the objects (objs) visible in the texture-free representation (SEM) from the specifiable camera position.

8. The method according to claim 7, wherein the object detector (62) is trained by means of the training data set.

9. The method according to claim 7 or 8, wherein in the respective scenes contained time courses of the kinetic variable (kin) by time courses a corresponding jerk (r) are shown.

10. Training device (140) which is set up to carry out the method according to one of claims 7 to 9.

11. Computer program which is set up to carry out the method according to one of claims 1 to 5 or 7 to 9.

12. Machine-readable storage medium (46, 146) on which the computer program according to claim 11 is stored.