DE102021204797A1

DE102021204797A1 - Apparatus and method for learning a guideline for off-road vehicles for construction sites

Info

Publication number: DE102021204797A1
Application number: DE102021204797.1A
Authority: DE
Inventors: Chana Ross; Dotan Di Castro; Yakov Miron
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2022-11-17
Also published as: JP2022174734A

Abstract

Ein computerimplementiertes Verfahren zum Erlernen einer Richtlinie (60) durch Verwenden von Verstärkungslernen und vorzugsweise durch einen digitalen Zwilling, wobei die erlernte Richtlinie (60) dafür ausgelegt ist, ein Geländefahrzeug zu steuern, wobei das Geländefahrzeug dafür ausgelegt ist, mit körnigem Material zu interagieren.A computer-implemented method for learning a policy (60) by using reinforcement learning and preferably a digital twin, the learned policy (60) being adapted to control an off-road vehicle, the off-road vehicle being adapted to interact with granular material.

Description

Die Erfindung betrifft (1) ein Verfahren zum Erlernen einer Richtlinie für Geländefahrzeuge auf einer Baustelle mit Verstärkungslernen, angewendet auf einen digitalen Zwilling, und (2) ein Verfahren zum Betreiben eines Stellglieds der Geländefahrzeuge, ein Computerprogramm und ein maschinenlesbares Datenspeichermedium und ein Trainingssystem.The invention relates to (1) a method for learning a policy for off-road vehicles on a construction site with reinforcement learning applied to a digital twin, and (2) a method for operating an actuator of off-road vehicles, a computer program and a machine-readable data storage medium and a training system.

Stand der TechnikState of the art

US10481603B2 offenbart einen Algorithmus zur Trajektorieplanung für Geländefahrzeuge. US10481603B2 discloses a trajectory planning algorithm for off-road vehicles.

Ein Ansatz zum Erlernen dieser Bewegungsbahn (Trajektorie) besteht im Verstärkungslernen, bei dem eine Richtlinie erstellt wird, die bei einem gegebenen Zustand des Fahrzeugs und/oder einem Zustand einer Umgebung die nächste Aktion für das Fahrzeug wählt.One approach to learning this trajectory is through reinforcement learning, which establishes a policy that, given a vehicle's state and/or an environment's state, chooses the next action for the vehicle.

Bei Verwendung des Verstärkungslernens für die Trajektorieplanung im Gelände müssen Daten aus der realen Welt erfasst werden. Sobald die Daten erfasst sind, können zwei Ansätze verfolgt werden - entweder das Trainieren mit den gegebenen Daten („modellfrei“) oder das Erstellen einer Simulation, welche die Interaktion zwischen dem Fahrzeug und der Umgebung emuliert („modellbasiert“), und Verwenden der Daten zur Validierung.When using reinforcement learning for trajectory planning in the field, real-world data must be collected. Once the data is collected, two approaches can be taken - either training with the given data ('model-free'), or creating a simulation that emulates the interaction between the vehicle and the environment ('model-based') and using the data for validation.

Diese Simulationen modellieren das Fahrzeug und dessen Interaktion mit der Umgebung basierend auf den erfassten Daten und vorzugsweise einer vereinfachten Physik und ermöglichen ein modellbasiertes Trainieren von Verstärkungslernalgorithmen.These simulations model the vehicle and its interaction with the environment based on the acquired data and preferably simplified physics and allow model-based training of reinforcement learning algorithms.

Vorteile der ErfindungAdvantages of the Invention

Die meisten Simulationen modellieren die Umgebung mit sehr komplexen physikalischen Gleichungen und erfordern numerische Simulationen, die rechenaufwändig sind. Diese Art der Simulation ist nicht hilfreich für das Verstärkungslerntraining, bei dem die Simulation einerseits die Interaktion imitieren muss, aber auch schnell rechnen und mehrere Trajektorien effizient berechnen muss.Most simulations model the environment with very complex physical equations and require numerical simulations that are computationally expensive. This type of simulation is not helpful for reinforcement learning training, where the simulation must imitate the interaction on the one hand, but also calculate quickly and calculate multiple trajectories efficiently.

Die vorliegende Erfindung ermöglicht eine volle Pipeline für das Trainieren eines Verstärkungslernalgorithmus, wenn die Interaktion zwischen dem Agenten (Geländefahrzeug) und der Umgebung nicht bekannt ist (modellfrei), jedoch nur begrenzte Möglichkeiten zum Erfassen von Daten der realen Welt bestehen. Daher wird eine automatisierte Lernpipeline vorgeschlagen, die keine manuelle Interaktion erfordert.The present invention allows for a full pipeline for training a reinforcement learning algorithm when the interaction between the agent (off-highway vehicle) and the environment is unknown (model-free), but has limited ability to collect real-world data. Therefore, an automated learning pipeline that does not require manual interaction is proposed.

Ein weiterer Vorteil ist die Reduzierung von Interaktionen mit der realen Welt zum Erfassen von Daten sowie dem Trainieren des Verstärkungslernalgorithmus. Dies ist der Möglichkeit geschuldet, anhand einer simulierten Umgebung zu trainieren, die gegebenenfalls nicht die reale Welt repliziert, aber die Hauptinteraktionen zwischen dem Fahrzeug und beispielsweise dem Boden enthält, welche Auswirkungen auf die Richtlinie haben. Daher wird von der vorliegenden Erfindung eine signifikant kleinere Datenmenge zur Konvergenz verwendet. Daher wird ein daten- sowie recheneffizienterer Ansatz offenbart. Des Weiteren kann eine schnelle Konvergenz erzielt werden.Another benefit is the reduction of interactions with the real world to collect data and train the reinforcement learning algorithm. This is due to the ability to train against a simulated environment that may not replicate the real world but contains the main interactions between the vehicle and, for example, the ground that impact policy. Therefore, a significantly smaller amount of data is used for convergence by the present invention. Therefore, a more data and computationally efficient approach is disclosed. Furthermore, fast convergence can be achieved.

Offenbarung der ErfindungDisclosure of Invention

Gemäß einem ersten Aspekt wird ein computerimplementiertes Verfahren zum Erlernen einer Richtlinie vorgeschlagen, die dafür ausgelegt ist, ein Geländefahrzeug zu steuern. Das Geländefahrzeug ist dafür ausgelegt, mit körnigem Material zu interagieren, z. B. auf einer Baustelle. Die Interaktion kann in einer Weise verstanden werden, dass das Geländefahrzeug das körnige Material verteilen kann. Beispielsweise kann es das körnige Material aufnehmen, transportieren und abladen oder das körnige Material sonstwie an einen anderen Ort bewegen.According to a first aspect, a computer-implemented method for learning a policy designed to control an off-road vehicle is proposed. The all-terrain vehicle is designed to interact with granular material, e.g. B. on a construction site. The interaction can be understood in a way that the all-terrain vehicle can spread the granular material. For example, it can pick up, transport and unload the granular material or otherwise move the granular material to another location.

Das Verfahren umfasst die folgenden Schritte:

Das Verfahren beginnt mit der Initialisierung eines Modells einer Umgebung des Geländefahrzeugs, parametrisiert durch ϕ. Das Modell ist geeignet, eine Ausgabe in Abhängigkeit von wenigstens einer eingegebenen Aktion des Geländefahrzeugs aus einem Satz von möglichen Aktionen zu bestimmen, wobei die Ausgabe wenigstens die Umgebung (S_t+1) nach dem Ausführen der eingegebenen Aktion und eine Belohnung (R_t) kennzeichnet.

The procedure includes the following steps:

The method begins with the initialization of a model of an environment of the off-road vehicle, parameterized by φ. The model is adapted to determine an output dependent on at least one input action of the off-road vehicle from a set of possible actions, the output being at least the environment (S _t+1 ) after performing the input action and a reward (R _t ) marks.

Anschließend erfolgt eine Umwandlung von erhaltenen realen Trajektorien eines Geländefahrzeugs und Zuständen der Umgebung und kommentierten Aktionen innerhalb der Trajektorie in Tupel ([S_t; A_t; R_t; S_t+1]). Es kann gesagt werden, dass dieser Schritt das Erfassen von Daten und das Erstellen eines Datensatzes zum Einrichten des Modells der Umgebung betrifft.Subsequently, the obtained real trajectories of an all-terrain vehicle and states of the environment and annotated actions within the trajectory are converted into tuples ([S _t ; A _t ; R _t ; S _t+1 ]). It can be said that this step concerns collecting data and creating a data set to set up the model of the environment.

Anschließend erfolgt eine Optimierung der Parameter ϕ des Modells, sodass das Modell die Interaktion zwischen dem Fahrzeug und der Umgebung in Abhängigkeit von den Tupeln bestmöglich emuliert. Mit anderen Worten wird das Modell dergestalt trainiert, dass es Tupel der erfassten Tupel zur erfahrenen Fahrerrichtlinie aus der realen Welt simuliert.The parameters ϕ of the model are then optimized so that the model emulates the interaction between the vehicle and the environment as best as possible depending on the tuples. In other words, the model becomes derge stalt is trained to simulate tuples of the captured tuples to real-world experienced driver policy.

Anschließend wird die Richtlinie durch Verstärkungslernen erlernt, basierend auf einer Interaktion mit dem Modell, sodass eine kumulierte Belohnung optimiert wird. Vorzugsweise erfolgt die Interaktion nur mit dem Modell.Then, the policy is learned through reinforcement learning based on interaction with the model so that a cumulative reward is optimized. The interaction preferably only takes place with the model.

Bei den Geländefahrzeugen kann es sich um Bulldozer, Verdichter, Kipper oder beliebige andere Fahrzeugtypen handeln, die eine Anzahl von Aufgaben haben, welche eine Interaktion mit der Umgebung beinhalten. Beispiele für diese Art von Aufgaben sind Planieren, Abladen von Sand, Verdichten eines Fläche, Abtragen des körnigen Materials etc.The off-road vehicles can be bulldozers, compactors, dump trucks, or any other type of vehicle that performs a number of tasks that involve interacting with the environment. Examples of this type of task are grading, dumping sand, compacting an area, removing the granular material, etc.

Es wird vorgeschlagen, dass der Zustand der Umgebung S_t zusätzlich einen Fahrzeugzustand umfasst. Der Fahrzeugzustand kann z. B. durch 6 DOF modelliert sein: (x,y,z) - Orte in einem euklidischen Raum und (ψ, θ, ϕ) als Euler-Winkel-Darstellung der Lage des Fahrzeugs in Bezug auf diesen kartesischen Raum. Vorzugsweise bei einem Geländefahrzeug mit einem lenkbaren Werkzeug, z. B. dem Planierschild eines Bulldozers, können weitere 2 DOF in Bezug auf das Fahrzeug, d. h. die Schildhöhe und der Winkel in Bezug auf die Aufhängung, hinzugefügt werden. Um einen guten Kompromiss zwischen Genauigkeit und minimaler Rechenlast zu erzielen, sind die DOF nur durch (x,y,z ψ) definiert. Vorzugsweise ist die Umgebung durch eine Matrix gekennzeichnet, wobei Reihen und Spalten Koordinaten (x, y) einer Position in der Umgebung darstellen und die jeweiligen Einträge der Matrix S(x; y) = h. körniges Material mit einer Höhe (h) an der jeweiligen Position kennzeichnen.It is proposed that the state of the environment S _t additionally includes a vehicle state. The vehicle condition can B. be modeled by 6 DOF: (x,y,z) - locations in a Euclidean space and (ψ, θ, ϕ) as an Euler angle representation of the attitude of the vehicle with respect to this Cartesian space. Preferably in an off-road vehicle with a steerable implement, e.g. eg a bulldozer blade, an additional 2 DOF can be added in relation to the vehicle, ie the blade height and the angle in relation to the suspension. In order to achieve a good compromise between accuracy and minimal computational load, the DOF are only defined by (x,y,z ψ). The environment is preferably characterized by a matrix, with rows and columns representing coordinates (x, y) of a position in the environment and the respective entries in the matrix S(x; y)=h. mark granular material with a height (h) at the respective position.

Des Weiteren wird vorgeschlagen, dass die erfassten realen Trajektorien und Zustände der Umgebung und die kommentierten Aktionen erfasst werden, indem Aktionen des Geländefahrzeugs aufgezeichnet werden, welches von einem menschlichen Bediener gefahren wird; und während des Aufzeichnens der Aktionen wird auch eine Umgebungshöhenkarte aufgezeichnet. Beispielsweise wird der Zustand der Umgebung dann in Abhängigkeit von der aufgezeichneten Umgebungshöhenkarte bestimmt.Furthermore, it is proposed that the captured real trajectories and states of the environment and the annotated actions are captured by recording actions of the off-road vehicle driven by a human operator; and while recording the actions, a surrounding height map is also recorded. For example, the state of the environment is then determined as a function of the recorded environment height map.

Des Weiteren wird vorgeschlagen, dass die Umgebungshöhenkarte speziell den Bereich kennzeichnet, in dem das Fahrzeug mit dem körnigen Material interagiert. Falls es sich bei dem Geländefahrzeug beispielsweise um einen Bulldozer handelt, ist die Umgebungshöhenkarte der Bereich, in dem eine Schaufel des Bulldozers mit dem körnigen Material in Kontakt kommt.Furthermore, it is proposed that the environmental elevation map specifically characterizes the area in which the vehicle interacts with the granular material. For example, if the off-road vehicle is a bulldozer, the environmental elevation map is the area where a blade of the bulldozer makes contact with the granular material.

Die Umgebungshöhenkarte kann unter Verwendung von LiDARs, Kameras oder mit Hilfe eines beliebigen anderen Sensors aufgezeichnet werden, anhand dessen die Umgebungshöhenkarte über die Zeit abgeleitet werden kann, und Fahrzeugsensoren wie etwa für Geschwindigkeit, Position, Euler-Winkel, Winkelgeschwindigkeit, Beschleunigung und beliebige andere wichtige Informationen zur Fahrzeugposition können für Fahrzeugzustände verwendet werden.The ambient height map can be recorded using LiDARs, cameras or any other sensor from which the ambient height map can be derived over time and vehicle sensors such as velocity, position, Euler angle, angular velocity, acceleration and any other relevant ones Vehicle position information may be used for vehicle conditions.

Des Weiteren wird vorgeschlagen, dass es sich bei dem körnigen Material um Erde oder Sand handelt. Dies stellt eine bevorzugte Ausführungsform dar, da die Simulation von körnigem Material sehr komplex ist. Da sich die Umgebung auf Basis der Fahrzeugaktionen ändert, ist das körnige Material im Hinblick auf dynamische Interaktionen und sich nachfolgend ergebende dynamische Formänderungen des körnigen Materials nach Interaktionen schwer zu simulieren. Daher ist bei Simulation der Umgebung durch ein Modell und nicht durch physikalische Simulationen die Simulationsgeschwindigkeit deutlich höher, und auch die Lerngeschwindigkeit der Richtlinie ist deutlich höher.Furthermore, it is suggested that the granular material is earth or sand. This represents a preferred embodiment since the simulation of granular material is very complex. Because the environment changes based on vehicle actions, the granular material is difficult to simulate in terms of dynamic interactions and subsequent resulting dynamic shape changes of the granular material after interactions. Therefore, when the environment is simulated by a model and not by physical simulations, the simulation speed is significantly higher, and the learning speed of the policy is also significantly higher.

Des Weiteren wird vorgeschlagen, dass das Modell der Umgebung ein neuronales Netz ist. Das neuronale Netz wird optimiert, indem die Distanz zwischen den nächsten Zuständen (S_t+1) der Tupel und den ausgegebenen nächsten Zuständen Ŝ_t+1 durch das neuronale Netz minimiert wird.Furthermore, it is proposed that the model of the environment is a neural network. The neural network is optimized by minimizing the distance between the next states (S _t+1 ) of the tuples and the next states Ŝ _t+1 output by the neural network.

Der Verlust des überwachten Lernmodells wäre L(S_t+1; Ŝ_t+1), wobei Ŝ_t+1 der reale Zustand ist und Ŝ_t+1 der simulierte Zustand des Modells ist. Die Optimierung des neuronalen Netzes kann durch einen Maschinenlernalgorithmus erfolgen, beispielsweise einen Gradientenabstieg.The loss of the supervised learning model would be L(S _t+1 ; Ŝ _t+1 ), where Ŝ _t+1 is the real state and Ŝ _t+1 is the simulated state of the model. The neural network can be optimized by a machine learning algorithm, for example gradient descent.

Ausführungsformen der Erfindung werden unter Bezugnahme auf die nachfolgenden Figuren ausführlicher erörtert. Die Figuren zeigen:

1 eine schematische Darstellung eines Geländefahrzeugs, insbesondere eines Bulldozers;
2 ein Flussdiagramm zum Trainieren einer Richtlinie für das Steuern des Geländefahrzeugs;
3 ein Trainingssystem zum Trainieren der Richtlinie.

Embodiments of the invention are discussed in more detail with reference to the following figures. The figures show:

1 a schematic representation of an all-terrain vehicle, in particular a bulldozer;
2 a flow chart for training a policy for controlling the off-road vehicle;
3 a training system for training the policy.

Beschreibung der AusführungsformenDescription of the embodiments

1 zeigt eine Ausführungsform eines Geländefahrzeugs, insbesondere eines Bulldozers 100. Der Bulldozer 100 umfasst ein Stellglied 10, das mit einem Steuerungssystem 40 interagiert. Bei vorzugsweise gleichmäßig beabstandeten Distanzen erkennt ein Sensor 30 eine Bedingung des Stellgliedsystems und/oder einen Zustand einer Umgebung im Umfeld des Bulldozers 100. Der Sensor 30 kann mehrere Sensoren umfassen. Vorzugsweise handelt es sich bei dem Sensor 30 um einen optischen Sensor, der Bilder der Umgebung 20 aufnimmt. Ein Ausgabesignal des Sensors 30 (oder, falls der Sensor 30 mehrere Sensoren umfasst, ein Ausgabesignal S für jeden der Sensoren), das die erkannte Bedingung codiert, wird an das Steuerungssystem 40 übertragen. 1 10 shows an embodiment of an off-road vehicle, in particular a bulldozer 100. The bulldozer 100 includes an actuator 10 that interacts with a control system 40. FIG. At preferably evenly spaced distances zen, a sensor 30 detects a condition of the actuator system and/or a condition of an environment surrounding the bulldozer 100. The sensor 30 may include multiple sensors. The sensor 30 is preferably an optical sensor that records images of the surroundings 20 . An output signal from the sensor 30 (or if the sensor 30 includes multiple sensors, an output signal S for each of the sensors) encoding the detected condition is transmitted to the control system 40 .

Mögliche Sensoren beinhalten, ohne jedoch hierauf beschränkt zu sein: Gyroskope, Beschleunigungsmesser, Kraftsensoren, Kameras, Radar, LiDAR, Winkelcodierer etc. Es sei darauf hingewiesen, dass Sensoren den Zustand des Systems oftmals nicht direkt messen, sondern vielmehr eine Auswirkung des Zustands beobachten, z. B. erkennt eine Kamera ein Bild, statt die relative Position des Fahrzeugs zu einem anderen Objekt direkt zu messen. Allerdings ist es möglich, den Zustand anhand von hochdimensionalen Beobachtungen wie Bildern oder LiDAR-Messungen zu filtern.Possible sensors include but are not limited to: gyroscopes, accelerometers, force sensors, cameras, radar, LiDAR, rotary encoders, etc. It should be noted that sensors often do not measure the state of the system directly, but rather observe an effect of the state, e.g. For example, a camera recognizes an image instead of directly measuring the vehicle's position relative to another object. However, it is possible to filter the condition based on high-dimensional observations such as images or LiDAR measurements.

Ferner muss das System ein Belohnungssignal r bereitstellen, das die Qualität des Systemzustands und der ausgeführten Aktion angibt. Typischerweise ist dieses Belohnungssignal dafür ausgelegt, das Verhalten des Lernalgorithmus zu steuern. Allgemein sollte das Belohnungssignal große Werte Zuständen/Aktionen zuweisen, die wünschenswert sind, und kleine (oder negative) Werte Zuständen/Aktionen zuweisen, die vom System vermieden werden sollten.Furthermore, the system must provide a reward signal r that indicates the quality of the system state and the action taken. Typically, this reward signal is designed to drive the behavior of the learning algorithm. In general, the reward signal should assign large values to states/actions that are desirable and small (or negative) values to states/actions that should be avoided by the system.

Mögliche Belohnungssignale beinhalten, ohne jedoch hierauf beschränkt zu sein: negativer Verfolgungsfehler für ein gewisses Referenzzustandssignal, Anzeigerfunktion für den Erfolg einer bestimmten Aufgabe, negative quadratische Kostenterme (ähnlich wie Verfahren von optimaler Steuerung) etc. Es ist auch möglich, ein weiteres Belohnungssignal als ein gewichtetes von anderen Belohnungssignalen zu konstruieren, falls der Lernalgorithmus mehrere Ziele gleichzeitig anstreben sollte. Ein Beispiel für eine positive Belohnung könnte „+1“ sein, falls der Agent die Aufgabe (a) mit zufriedenstellender Leistung und (b) schnell erledigt hat. Ein Beispiel für negative Belohnungen könnte „-1“ sein, falls der Agent die Aufgabe langsam erledigt hat. Ein weiteres Beispiel für eine große negative Belohnung könnte „-100“ sein, falls das Fahrzeug die vertrauenswürdige zulässige Region verlassen hat.Possible reward signals include, but are not limited to: negative tracking error for some reference state signal, indicator function for the success of a particular task, negative quadratic cost terms (similar to optimal control methods), etc. It is also possible to use another reward signal than a weighted one from other reward signals if the learning algorithm should aim at several goals at the same time. An example of a positive reward might be "+1" if the agent completed the task (a) with satisfactory performance and (b) quickly. An example of negative rewards could be "-1" if the agent was slow in completing the task. Another example of a large negative reward could be "-100" in case the vehicle left the trusted allowed region.

Dadurch empfängt das Steuerungssystem 40 einen Strom von Sensorsignalen. Es berechnet dann eine Reihe von Stellgliedsteuerbefehlen A in Abhängigkeit von dem Strom von Sensorsignalen, die dann an das Stellglied 10 übertragen werden.As a result, the control system 40 receives a stream of sensor signals. It then calculates a series of actuator commands A depending on the stream of sensor signals, which are then transmitted to the actuator 10.

Das Steuerungssystem 40 empfängt den Strom von Sensorsignalen S des Sensors 30 in einer optionalen Empfangseinheit. Die Empfangseinheit wandelt die Sensorsignale S in ein aktuelles Zustandssignal S_t um. Alternativ, falls keine Empfangseinheit vorhanden ist, kann jedes Sensorsignal direkt als ein aktuelles Zustandssignal S_t herangezogen werden.The control system 40 receives the stream of sensor signals S from the sensor 30 in an optional receiving unit. The receiving unit converts the sensor signals S into a current status signal S _t . Alternatively, if there is no receiving unit, each sensor signal can be used directly as a current status signal S _t .

Das Zustandssignal S_t wird dann an eine optimierte Richtlinie 60 weitergegeben, die beispielsweise von einem künstlichen neuronalen Netz bereitgestellt wird.The status signal S _t is then forwarded to an optimized guideline 60, which is provided by an artificial neural network, for example.

Die optimierte Richtlinie 60 ist durch Parameter ϕ parametrisiert, die in einem Parameterdatenspeicher gespeichert sind und von diesem bereitgestellt werden.The optimized policy 60 is parameterized by parameters φ stored in and provided by a parameter data store.

Die optimierte Richtlinie 60 bestimmt auszugebende Aktionssignale A_t anhand des aktuellen Zustandssignals S_t. Das Aktionssignal A_t wird an eine optionale Umwandlungseinheit übertragen, die das Aktionssignal A_t in die Steuerbefehle A umwandelt. Die Stellgliedsteuerbefehle A werden dann an das Stellglied 10 übertragen, um das Stellglied 10 entsprechend zu steuern. Alternativ können die Ausgabesignale y direkt als Steuerbefehle A herangezogen werden.The optimized guideline 60 determines action signals A _t to be output based on the current status signal S _t . The action signal A _t is transmitted to an optional conversion unit, which converts the action signal A _t into the A control commands. The actuator control commands A are then transmitted to the actuator 10 to control the actuator 10 accordingly. Alternatively, the output signals y can be used directly as control commands A.

Das Stellglied 10 empfängt die Stellgliedsteuerbefehle A, wird entsprechend gesteuert und führt eine Aktion aus, die den Stellgliedsteuerbefehlen A entspricht. Das Stellglied 10 kann eine Steuerungslogik umfassen, die den Stellgliedsteuerbefehl A in einen weiteren Steuerbefehl umwandelt, der dann verwendet wird, um das Stellglied 10 zu steuern.Actuator 10 receives actuator commands A, is controlled accordingly, and performs an action corresponding to actuator commands A . Actuator 10 may include control logic that converts actuator command A into another command that is then used to control actuator 10 .

Des Weiteren kann das Steuerungssystem 40 einen Prozessor 45 (oder mehrere Prozessoren) und wenigstens ein maschinenlesbares Datenspeichermedium 46 umfassen, auf dem Anweisungen gespeichert sind, die, falls sie ausgeführt werden, das Steuerungssystem 40 veranlassen, ein Verfahren zum Steuern des Bulldozers 100 in Abhängigkeit von der optimierten Richtlinie auszuführen.Furthermore, the control system 40 may include a processor 45 (or processors) and at least one machine-readable data storage medium 46 storing instructions which, when executed, cause the control system 40 to implement a method of controlling the bulldozer 100 in dependence on to run the optimized policy.

Vorzugsweise handelt es sich bei dem Geländefahrzeug um ein wenigstens teilweise autonomes Fahrzeug, das teilweise durch die Richtlinie gesteuert wird.Preferably, the off-road vehicle is an at least partially autonomous vehicle that is partially controlled by the policy.

2 zeigt eine Ausführungsform eines Verfahrens 20 zum Erhalten der optimierten Richtlinie 60 für das Steuern des Geländefahrzeugs. 2 12 shows one embodiment of a method 20 for obtaining the optimized policy 60 for controlling the off-road vehicle.

Das Verfahren 20 beginnt mit dem Initialisieren (S21) eines Modells einer Umgebung des Geländefahrzeugs, parametrisiert durch die Parameter ϕ. Das Modell selbst ist geeignet, eine Ausgabe in Abhängigkeit von wenigstens einer eingegebenen Aktion des Geländefahrzeugs aus einem Satz von möglichen Aktionen zu bestimmen, wobei die Ausgabe wenigstens die Umgebung (S_t+1) nach dem Ausführen der eingegebenen Aktion und eine Belohnung (R_t) kennzeichnet.The method 20 starts with the initialization (S21) of a model of an environment of the off-road vehicle, parameterized by the parameters φ. The model itself is suitable for an output in Dependent on at least one input action of the off-road vehicle to determine from a set of possible actions, the output identifying at least the environment (S _t+1 ) after performing the input action and a reward (R _t ).

Nach Schritt S21 erfolgt ein Umwandeln (S22) von erhaltenen realen Trajektorien eines Geländefahrzeugs und zugewiesenen Zuständen der Umgebung und kommentierten Aktionen innerhalb der Trajektorie in Tupel ([S_t; A_t; R_t; S_t+1]). Es wird darauf hingewiesen, dass Schritt S22 alternativ übersprungen werden kann, falls die Tupel bereits bereitgestellt worden sind.After step S21 there is a conversion (S22) of received real trajectories of an all-terrain vehicle and assigned states of the environment and commented actions within the trajectory into tuples ([S _t ; A _t ; R _t ; S _{t+1 ]} ). It is noted that step S22 can alternatively be skipped if the tuples have already been provided.

Danach folgt eine Optimierung (S23) von Parametern des Modelles einer Umgebung (digitaler Zwilling), sodass das Modell die Interaktion zwischen dem Fahrzeug und der Umgebung in Abhängigkeit von den Tupeln bestmöglich emuliert.This is followed by an optimization (S23) of parameters of the model of an environment (digital twin), so that the model emulates the interaction between the vehicle and the environment as best as possible depending on the tuples.

Danach folgt ein Erlernen (S24) der optimalen Richtlinie 60 durch Verstärkungslernen, nur basierend auf der Interaktion mit dem Modell, sodass eine Belohnung optimiert wird.This is followed by learning (S24) the optimal policy 60 by reinforcement learning only based on the interaction with the model so that a reward is optimized.

Der letzte Schritt ist die Ausgabe (S25) der optimalen Richtlinie (60).The last step is the output (S25) of the optimal guideline (60).

In einem optionalen Schritt nach Schritt S25 wird die optimale Richtlinie genutzt, um das Geländefahrzeug, insbesondere den Bulldozer 100, zu steuern.In an optional step after step S25, the optimal guideline is used to control the off-road vehicle, in particular the bulldozer 100.

In 3 wird eine Ausführungsform eines Trainingssystems zum Trainieren der Richtlinie 60 gezeigt.In 3 One embodiment of a training system for training policy 60 is shown.

Die Datenbank 300 umfasst die aufgezeichneten Tupel [S_t; A_t; R_t; S_t+1] der aufgezeichneten Trajektorien. Diese Tupel werden von einem überwachten Lernalgorithmus 301 genutzt, um den digitalen Zwilling 302 der Umgebung und vorzugsweise des Geländefahrzeugs zu erstellen.The database 300 includes the recorded tuples [S _t ; _At ; _Rt ; S _t+1 ] of the recorded trajectories. These tuples are used by a supervised learning algorithm 301 to create the digital twin 302 of the environment and preferably the off-road vehicle.

Der digitale Zwilling 302 gibt bestimmte Tupel ([S_t; A_t; R_t; S_t+1]) zurück, die von dem überwachten Lernalgorithmus 301 analysiert werden. Der überwachte Lernalgorithmus 301 liefert anschließend verbesserte Parameter ϕ an den digitalen Zwilling 302. Dadurch wird die Leistung des digitalen Zwillings verbessert. Diese zwei Interaktionen zwischen dem überwachten Lernalgorithmus 301 und dem digitalen Zwilling 302 werden mehrmals wiederholt.The digital twin 302 returns certain tuples ([S _t ; A _t ; R _t ; S _t+1 ]) that are analyzed by the supervised learning algorithm 301 . The supervised learning algorithm 301 then provides improved parameters φ to the digital twin 302. This improves the performance of the digital twin. These two interactions between the supervised learning algorithm 301 and the digital twin 302 are repeated multiple times.

Dann wird ein Verstärkungslernalgorithmus 303 verwendet, um die Richtlinie 60 basierend auf Interaktionen mit dem digitalen Zwilling 302 zu optimieren. Über den Verstärkungslernalgorithmus 302 und die Richtlinie 60 wird eine Aktion A_t bestimmt und an den digitalen Zwilling 302 übergeben. Je nach der Aktion A_t gibt der digitale Zwilling eine Belohnung R_t an den Verstärkungslernalgorithmus 303 zurück, welcher die Richtlinie so anpasst, dass die Belohnung maximiert wird. Diese zwei Interaktionen zwischen dem Verstärkungslernalgorithmus 303 und dem digitalen Zwilling 302 werden mehrmals wiederholt.A reinforcement learning algorithm 303 is then used to optimize the policy 60 based on interactions with the digital twin 302 . An action _At is determined via the reinforcement learning algorithm 302 and the policy 60 and passed to the digital twin 302 . Depending on the action _At , the digital twin returns a reward _Rt to the reinforcement learning algorithm 303, which adjusts the policy to maximize the reward. These two interactions between the reinforcement learning algorithm 303 and the digital twin 302 are repeated multiple times.

ZITATE ENTHALTEN IN DER BESCHREIBUNGQUOTES INCLUDED IN DESCRIPTION

Diese Liste der vom Anmelder aufgeführten Dokumente wurde automatisiert erzeugt und ist ausschließlich zur besseren Information des Lesers aufgenommen. Die Liste ist nicht Bestandteil der deutschen Patent- bzw. Gebrauchsmusteranmeldung. Das DPMA übernimmt keinerlei Haftung für etwaige Fehler oder Auslassungen.This list of documents cited by the applicant was generated automatically and is included solely for the better information of the reader. The list is not part of the German patent or utility model application. The DPMA assumes no liability for any errors or omissions.

Zitierte PatentliteraturPatent Literature Cited

US 10481603 B2 [0002]

Claims

A computer-implemented method (20) for learning a policy (60) configured to control an off-road vehicle, the off-road vehicle configured to interact with granular material, comprising the steps of: i. Initializing (S21) a model of an environment of the all-terrain vehicle, which is parameterized, the model being suitable for determining an output as a function of at least one input action of the all-terrain vehicle from a set of possible actions, the output parameter including at least the environment ( S _t+1 ) after performing the inputted action and a reward (R _t ); ii. Converting (S22) received real trajectories of an off-road vehicle and states of the environment and annotated actions within the trajectory into tuples ([S _t ; A _t ; R _t ; S _t+1 ]), comprising a state of the environment (S _t ), an action (A _t ) performed by the off-road vehicle, a reward (R _t ), and the next environment state (S _t+1 ) for each time step, the next environment state being based on the environment after the action is performed marked on the comment; iii. Optimizing (S23) parameters of the model of an environment (digital twin) so that the model best emulates the interaction between the vehicle and the environment depending on the tuples; IV. learning (S24) an optimal guideline (π(S _t )) through reinforcement learning based on interaction with the model so that a reward is optimized; and V. Outputting (S25) the optimal guideline (60).

procedure according to claim 1 , wherein the state of the environment (S _t ) additionally comprises a vehicle state, the state of the environment being characterized by a matrix in which the rows and columns represent coordinates (x, y) of a position in the environment and the respective entries of the matrix (S(x; y) = h) denote granular material with a given height (h) at the respective position.

procedure according to claim 1 or 2 , wherein the detected real trajectories and states of the environment and the annotated actions are acquired by recording actions of the off-road vehicle driven by a human operator, wherein during the recording of the actions an environment height map is recorded, the state of the environment depending on the recorded environmental elevation map is determined.

procedure according to claim 3 , where the granular material is earth or sand.

Method according to one of Claims 1 until 4 , where the model of the environment is a neural network, where the neural network is optimized by minimizing the distance between the next states (S _t+1 ) of the tuples and the next states output by the neural network.

Method according to one of Claims 1 until 4 , where a model of an environment is a digital twin of the environment and the terrain vehicle.

Controlling the off-road vehicle based on the learned policy and depending on a state of the environment based on the optimized policy (π(S _t )), the optimized policy being obtained according to any one of the preceding claims.

Apparatus adapted to carry out the method according to any one of the preceding claims.

Computer program designed to cause a computer to perform the method according to any one of Claims 1 until 6 with all associated steps if the computer program is executed by a processor (45).

Machine-readable data storage medium (46) on which the computer program according to claim 9 is saved.