DE102018207015B4

DE102018207015B4 - Method for training self-learning control algorithms for autonomously moveable devices and autonomously moveable device

Info

Publication number: DE102018207015B4
Application number: DE102018207015.6A
Authority: DE
Inventors: Karsten Berger
Original assignee: Audi AG
Current assignee: Audi AG
Priority date: 2018-05-07
Filing date: 2018-05-07
Publication date: 2023-08-24
Anticipated expiration: 2038-05-08
Also published as: DE102018207015A1; WO2019215003A1

Abstract

Verfahren zum Trainieren selbstlernender Steuerungsalgorithmen (2) für autonom bewegbare Vorrichtungen (4), wobei verfahrensgemäß- ein virtueller Trainingsraum (6) mit einer Anzahl von Randbedingungen bereitgestellt wird,- in einem Trainingsdurchlauf wenigstens zwei eigenständig implementierte und selbstlernende Steuerungsalgorithmen (2) zur Erfüllung einer Zielvorgabe zur gleichen Zeit jeweils einen virtuellen Agenten (18) in dem Trainingsraum (6) bewegen,- eine positive Incentivierung und/oder eine negative Incentivierung wenigstens eines der Steuerungsalgorithmen (2) bei Erfüllung der Zielvorgabe bzw. bei Nichterfüllung und/oder Überschreiten einer Randbedingung durch den jeweiligen Steuerungsalgorithmus (2) erfolgt, wobei zumindest die negative Incentivierung eines Steuerungsalgorithmus (2) dem wenigstens einen anderen Steuerungsalgorithmus (2) zugeführt wird, wobei zum Trainieren des Verhaltens der Steuerungsalgorithmen unter Interaktion mit menschlichem Verhalten in einem Trainingsdurchlauf ein Agent (18) von Bedienpersonal durch den Trainingsraum (6) bewegt wird.A method for training self-learning control algorithms (2) for autonomously movable devices (4), wherein according to the method - a virtual training room (6) is provided with a number of boundary conditions, - in a training run at least two independently implemented and self-learning control algorithms (2) to fulfill a Target at the same time move a virtual agent (18) in the training room (6), - a positive incentive and / or a negative incentive of at least one of the control algorithms (2) when the target is met or non-fulfillment and / or exceeding a boundary condition carried out by the respective control algorithm (2), with at least the negative incentive of a control algorithm (2) being fed to the at least one other control algorithm (2), with an agent (18) is moved through the training room (6) by operating personnel.

Description

Die Erfindung betrifft ein Verfahren zum Trainieren selbstlernender Steuerungsalgorithmen für autonom bewegbare Vorrichtungen. Außerdem betrifft die Erfindung auch eine autonom bewegbare Vorrichtung, die in einem autonomen Betriebsmodus insbesondere mittels eines Steuerungsalgorithmus der vorstehenden Art bewegt wird.The invention relates to a method for training self-learning control algorithms for autonomously movable devices. In addition, the invention also relates to an autonomously movable device, which is moved in an autonomous operating mode, in particular by means of a control algorithm of the above type.

Autonom betreibbare Vorrichtungen kommen und sollen zukünftig vermehrt zum Einsatz kommen, um beispielsweise einen Bediener der Vorrichtung zu ersetzen oder zumindest zeitweise entlasten zu können. Beispielsweise kann es sich bei einer solchen autonom betreibbare Vorrichtung um Fahrzeuge, beispielsweise Personenkraftwagen, Fluggeräte, Lastkraftwagen oder auch um Fertigungsroboter handeln. Um derartige Vorrichtungen möglichst flexibel autonom betreiben zu können, werden diese meist durch selbstlernende Algorithmen („machine learning algorithms“) gesteuert. Diese Algorithmen werden dabei optional während einer Lern- oder Trainingsphase eingelernt („trainiert“), um in möglichst vielen unterschiedlichen Situationen zu erwünschten Ergebnissen zu gelangen, d. h. entsprechende, von dem Nutzer als korrekt eingeschätzte Entscheidungen zu treffen. In diesem Fall ist das „Lernen“ der Algorithmen vor der realen Inbetriebnahme der Vorrichtung meist abgeschlossen. Optional kommen aber auch Algorithmen zum Einsatz, die während des realen Betriebs (weiterhin) lernen.Devices that can be operated autonomously are and are to be increasingly used in the future, for example in order to be able to replace an operator of the device or at least to be able to relieve it temporarily. For example, such an autonomously operable device can be a vehicle, for example a passenger car, aircraft, truck or even a production robot. In order to be able to operate such devices autonomously as flexibly as possible, they are usually controlled by self-learning algorithms (“machine learning algorithms”). These algorithms are optionally learned ("trained") during a learning or training phase in order to achieve the desired results in as many different situations as possible, i. H. to make appropriate decisions that the user judges to be correct. In this case, the "learning" of the algorithms is usually completed before the device is actually put into operation. Optionally, however, algorithms are also used that (continue to) learn during real operation.

Nachteilig ist dabei insbesondere, dass abhängig von der von den selbstlernenden Algorithmen zu lösenden Aufgabe das Training, insbesondere die Zusammenstellung von Trainingsdaten vergleichsweise komplex werden kann, so dass dabei ein enormer Arbeitsaufwand anfallen kann.A particular disadvantage here is that, depending on the task to be solved by the self-learning algorithms, the training, in particular the compilation of training data, can become comparatively complex, so that an enormous amount of work can be incurred.

In US 2015 / 0 106 308 A1 ist ein verteiltes Lernen für eine Vielzahl von selbstlernenden Algorithmen für autonom betreibbare Vorrichtungen beschrieben, wobei die einzelnen selbstlernenden Algorithmen in gemeinsamen Lernumgebungen trainiert werden und auch auf gemeinsame Erfahrungs-Datenbanken zugreifen können. Dadurch können insbesondere Fehlerraten beim Lernen verringert werden. Vergleichbares ist auch aus US 2014 / 0 089 232 A1 bekannt.In U.S. 2015/0 106 308 A1 describes distributed learning for a large number of self-learning algorithms for autonomously operable devices, with the individual self-learning algorithms being trained in common learning environments and also being able to access common experience databases. In this way, in particular, error rates during learning can be reduced. Comparable is also out U.S. 2014/0 089 232 A1 known.

Der Erfindung liegt die Aufgabe zugrunde, das Trainieren selbstlernender Steuerungsalgorithmen zu verbessern. Der Erfindung liegt außerdem die Aufgabe zugrunde, auf einfache Weise einen autonomen Betrieb einer autonom bewegbaren Vorrichtung zu ermöglichen.The object of the invention is to improve the training of self-learning control algorithms. The invention is also based on the object of enabling autonomous operation of an autonomously movable device in a simple manner.

Diese Aufgabe wird hinsichtlich des Trainierens selbstlernender Steuerungsalgorithmen erfindungsgemäß gelöst durch ein Verfahren mit den Merkmalen des Anspruchs 1. Hinsichtlich des autonomen Betriebs wird die Aufgabe erfindungsgemäß gelöst durch eine Vorrichtung mit den Merkmalen des Anspruchs 9. Weitere vorteilhafte und teils für sich erfinderische Ausführungsformen und Weiterentwicklungen der Erfindung sind in den Unteransprüchen und der nachfolgenden Beschreibung dargelegt.With regard to the training of self-learning control algorithms, this object is achieved according to the invention by a method having the features of claim 1. With regard to autonomous operation, the object is achieved according to the invention by a device having the features of claim 9 The invention is set out in the subclaims and the description below.

Das erfindungsgemäße Verfahren dient zum Trainieren selbstlernender Steuerungsalgorithmen für autonom bewegbare Vorrichtungen. Verfahrensgemäß wird dabei zunächst ein virtueller Trainingsraum mit einer Anzahl von Randbedingungen bereitgestellt. In diesem Trainingsraum bewegen im Rahmen eines Trainingsdurchlaufs (d. h. während eines solchen Trainingsdurchlaufs) zur gleichen Zeit wenigstens zwei eigenständig implementierte und selbstlernende Steuerungsalgorithmen zur Erfüllung einer Zielvorgabe jeweils einen virtuellen Agenten, der vorzugsweise eine autonom bewegbare Vorrichtung darstellt oder abbildet. Dabei erfolgt eine positive Incentivierung und/oder eine negative Incentivierung wenigstens eines der Steuerungsalgorithmen bei Erfüllung der Zielvorgabe bzw. bei Nichterfüllung und/oder bei Überschreiten einer Randbedingung durch den jeweiligen Steuerungsalgorithmus. Das heißt, dass dem entsprechenden Steuerungsalgorithmus eine Belohnung (die positive Incentivierung) oder eine Art „Bestrafung“ (die negative Incentivierung) zugeführt wird. Zumindest die negative Incentivierung eines Steuerungsalgorithmus wird dabei dem wenigstens einen anderen Steuerungsalgorithmus (insbesondere ebenfalls) zugeführt.The method according to the invention serves to train self-learning control algorithms for autonomously movable devices. According to the method, a virtual training room with a number of boundary conditions is first provided. In this training room, as part of a training run (i.e. during such a training run), at least two independently implemented and self-learning control algorithms each move a virtual agent, which preferably represents or depicts an autonomously movable device, to meet a target specification. At least one of the control algorithms is given a positive incentive and/or a negative incentive when the target specification is met or when it is not met and/or when a boundary condition is exceeded by the respective control algorithm. This means that a reward (the positive incentive) or a kind of “punishment” (the negative incentive) is fed to the corresponding control algorithm. At least the negative incentive for a control algorithm is fed to the at least one other control algorithm (in particular also).

Als Steuerungsalgorithmus wird somit vorzugsweise jeweils ein zum bestärkenden Lernen („reinforcement learning“) ausgebildeter Algorithmus herangezogen.An algorithm designed for reinforcement learning is thus preferably used in each case as the control algorithm.

Unter dem Begriff „Trainingsraum“ wird hier und im Folgenden insbesondere eine virtuelle Umgebung verstanden, in der sich der jeweilige Agent bewegen kann. In diesem Trainingsraum können eine oder vorzugweise mehrere Randbedingungen abgebildet werden oder auf andere Weise vorgegeben sein. Bei dem Trainingsraum handelt es sich wenigstens um eine zweidimensionale Fläche, entlang derer sich der jeweilige Agent bewegen kann. Vorzugsweise handelt es sich bei dem Trainingsraum aber um einen dreidimensionalen Raum, um für den Trainingsdurchlauf eine möglichst realitätsnahe Umgebung (auch: Szenario) abbilden zu können. Unter dem Begriff „Randbedingung“ wird hier und im Folgenden insbesondere einerseits eine geometrische Gegebenheit der in dem Trainingsraum dargestellten Umgebung und des dabei abgebildeten Szenarios verstanden, bspw. ein (sich gegebenenfalls selbsttätig bewegendes, ein hartes oder weiches) Hindernis, (für den Fall, dass der jeweilige Agent ein Kraftfahrzeug darstellt) einen befahrbaren Weg, Verkehrszeichen und dergleichen. In Abhängigkeit des abgebildeten Szenarios können sich manche Randbedingungen optional auch ändern, bspw. kann ein Hindernis temporär auftreten, eine Person sich in der Umgebung bewegen oder dergleichen. Andererseits wird unter dem Begriff „Randbedingung“ hier und im Folgenden insbesondere auch eine Verhaltensregel für den jeweiligen Agenten verstanden. Beispielsweise wird im Rahmen einer Verhaltensregel vorgegeben, dass der Agent nicht von dem befahrbaren Weg auf einen wegfreien Bereich der Umgebung abfahren darf, dass eine bestimmte Art von Hindernis nicht touchiert, einer anderen Art von Hindernis sich nur bis auf einen vorgegebenen Abstand angenähert werden darf und dergleichen.Here and in the following, the term “training room” means in particular a virtual environment in which the respective agent can move. In this training space, one or preferably several boundary conditions can be mapped or specified in some other way. The training space is at least a two-dimensional surface along which the respective agent can move. However, the training space is preferably a three-dimensional space in order to be able to depict an environment (also: scenario) that is as realistic as possible for the training run. Here and in the following, the term “boundary condition” is understood to mean, on the one hand, a geometric condition of the environment shown in the training room and the scenario shown, e.g. a (possibly moving automatically, a hard or white ches) obstacle (in the event that the respective agent represents a motor vehicle) a passable path, traffic signs and the like. Depending on the scenario depicted, some boundary conditions can optionally also change, for example an obstacle can appear temporarily, a person can move in the area or the like. On the other hand, the term “boundary condition” here and in the following is also understood to mean a rule of conduct for the respective agent. For example, as part of a behavioral rule, it is specified that the agent must not leave the passable path and enter a path-free area of the environment, that a certain type of obstacle must not be touched, another type of obstacle may only be approached up to a specified distance, and the like.

Vorzugsweise ist der Trainingsraum außerdem derart gestaltet, dass die Randbedingungen zumindest teilweise dem jeweiligen Agenten und somit auch dem jeweiligen Steuerungsalgorithmus in Form von (virtuellen) Sensordaten - bspw. einer (virtuellen) Kamera, eines (virtuellen) Abstandssensors oder dergleichen - bereitgestellt werden. Somit erfasst in dem Trainingsdurchlauf der jeweilige Steuerungsalgorithmus die in dem Trainingsraum abgebildete Umgebung vorzugsweise sensoriell. Das heißt, dass der jeweilige Steuerungsalgorithmus über Schnittstellen zur Eingabe (also insbesondere zum Empfang) entsprechender Sensorsignale verfügt und im Trainingsdurchlauf der jeweilige Steuerungsalgorithmus über diese Schnittstellen entsprechende Eingangssignale zur Erfassung und Analyse des Trainingsraums empfängt.Preferably, the training room is also designed in such a way that the boundary conditions are at least partially provided to the respective agent and thus also to the respective control algorithm in the form of (virtual) sensor data - e.g. a (virtual) camera, a (virtual) distance sensor or the like. Thus, in the training run, the respective control algorithm captures the surroundings depicted in the training room, preferably using sensors. This means that the respective control algorithm has interfaces for inputting (i.e. in particular for receiving) corresponding sensor signals and during the training run the respective control algorithm receives corresponding input signals via these interfaces for recording and analyzing the training room.

Dadurch, dass wenigstens zwei voneinander grundsätzlich unabhängige Steuerungsalgorithmen in dem Trainingsraum jeweils einen Agenten zur Lösung der im Rahmen der Zielvorgabe gestellten Aufgabe bewegen, und dass zumindest die negative Incentivierung dem jeweils anderen - im Fall von mehr als zwei Steuerungsalgorithmen allen anderen - Steuerungsalgorithmen mitgeteilt werden, können vorteilhafterweise alle Steuerungsalgorithmen zumindest von den „Fehlern“ eines Steuerungsalgorithmus mitlernen. Dadurch kann vorteilhafterweise vermieden werden, dass der jeweils andere Steuerungsalgorithmus in einem - vorzugsweise durchgeführten - weiteren Trainingsdurchlauf den gleichen Fehler ebenfalls begeht. Dies führt wiederum zu einer Verringerung der effektiven Lerndauer aller Steuerungsalgorithmen, die gemeinsam in dem Trainingsraum trainiert werden. Dadurch, dass wenigstens zwei autonome Agenten im gleichen Trainingsraum bewegen, kommt es ferner auch zu Interaktionen der beiden Agenten, die Reaktionen der jeweiligen Steuerungsalgorithmen erfordern. Insbesondere für den Fall, dass die Agenten Kraftfahrzeuge und der Trainingsraum im Wesentlichen ein Straßennetz darstellen, lernt somit jeder Steuerungsalgorithmus auch das Verhalten gegenüber anderen Verkehrsteilnehmern. Aufgrund des Trainings in einer virtuellen Umgebung ist die Dauer des gesamten Trainings maßgeblich von den Rechen- und Speicherkapazitäten des zur Bereitstellung der Trainingsumgebung (insbesondere dem Trainingsraum) und für die Lernprozesse der Steuerungsalgorithmen eingesetzten Rechnersystems dominiert.Due to the fact that at least two mutually independent control algorithms in the training room each move an agent to solve the task set within the framework of the target specification, and that at least the negative incentive is communicated to the other control algorithm - in the case of more than two control algorithms to all other control algorithms, all control algorithms can advantageously learn at least from the "errors" of a control algorithm. In this way, it can advantageously be avoided that the other control algorithm in each case also commits the same error in a further training run—preferably carried out. This in turn leads to a reduction in the effective learning time of all control algorithms that are trained together in the training room. Due to the fact that at least two autonomous agents move in the same training room, there are also interactions between the two agents that require reactions from the respective control algorithms. In particular, in the event that the agents are motor vehicles and the training room essentially represents a road network, each control algorithm thus also learns the behavior towards other road users. Due to the training in a virtual environment, the duration of the entire training is largely dominated by the computing and storage capacities of the computer system used to provide the training environment (especially the training room) and for the learning processes of the control algorithms.

In einer zweckmäßigen Verfahrensvariante wird auch die positive Incentivierung eines der Steuerungsalgorithmen dem wenigstens einen anderen Steuerungsalgorithmus zugeführt. Dadurch wird vorteilhafterweise erreicht, dass der jeweils andere Steuerungsalgorithmus auch von dem zielführenden Verhalten des positiv incentivierten Steuerungsalgorithmus lernen kann.In an expedient variant of the method, the positive incentive for one of the control algorithms is also fed to the at least one other control algorithm. This advantageously means that the respective other control algorithm can also learn from the targeted behavior of the positively incentivized control algorithm.

In einer weiteren zweckmäßigen Verfahrensvariante wird dem jeweiligen anderen (insbesondere jedem anderen) Steuerungsalgorithmus eine Information über das zu der positiven bzw. negativen Incentivierung führende Verhalten des jeweils anderen Steuerungsalgorithmus zugeführt. Dadurch wird vorteilhafterweise ermöglicht, dass der Steuerungsalgorithmus, dem die jeweilige (positive oder negative) Incentivierung des jeweils anderen Steuerungsalgorithmus mitgeteilt wird, diese auch besonders einfach einem spezifischen Verhalten zuordnen kann. Für den Fall, dass es sich bei den Agenten um Kraftfahrzeuge handelt und somit die Steuerungsalgorithmen im bestimmungsgemäßen realen Betrieb zur Bewegung eines Kraftfahrzeugs dienen, wird also einem Steuerungsalgorithmus, der die Zielvorgabe erfüllt, bspw. mitgeteilt, dass das Ignorieren eines Verkehrsschilds, einer Vorfahrtregel, einer Ampel oder dergleichen zu einer negativen Incentivierung führt. Entsprechend umgekehrt wird einem Steuerungsalgorithmus auch mitgeteilt, dass bspw. ein bestimmter Weg unter Umgehung mehrfacher Abbiegungen schneller zu einer bestimmten Zielposition und somit zur positiven Incentivierung führt.In a further expedient variant of the method, information about the behavior of the respective other control algorithm leading to the positive or negative incentive is fed to the respective other (in particular every other) control algorithm. This advantageously makes it possible for the control algorithm, to which the respective (positive or negative) incentive of the respective other control algorithm is communicated, to be able to assign this to a specific behavior in a particularly simple manner. In the event that the agents are motor vehicles and the control algorithms are therefore used to move a motor vehicle in the intended real operation, a control algorithm that meets the target specification is informed, for example, that ignoring a traffic sign, a priority rule, a traffic light or the like leads to a negative incentive. Conversely, a control algorithm is also informed that, for example, a specific route, avoiding multiple turns, leads more quickly to a specific target position and thus to a positive incentive.

In einer Variante des Verfahrens wird die Zielvorgabe derart gestellt, dass jeder Steuerungsalgorithmus seinen Agenten möglichst schnell zu einer vorgegeben Zielposition bewegen soll. In diesem Fall erhält der Steuerungsalgorithmus, der seinen Agenten in der kürzesten Zeit - unter Einhaltung der Randbedingung, also bspw. der Verhaltensregeln und ohne Kollision mit einem Hindernis oder dem wenigstens einen anderen Agenten - zur Zielposition bewegt, die positivste Incentivierung (d. h. die höchste oder stärkste Belohnung). Der Steuerungsalgorithmus, dessen Agent einen Unfall verursacht und somit die Zielvorgabe nicht erfüllen kann erhält entsprechend eine negative Incentivierung. Bei Erfüllung der Zielvorgabe zu einem späteren Zeitpunkt wird dem entsprechendem Steuerungsalgorithmus eine positive Incentivierung mit „geringerer“ Ausprägung, d. h. bspw. eine weniger starke oder große Belohnung zugewiesen. Alle Steuerungsalgorithmen werden auch hier vorzugsweise von den positiven oder negativen Incentivierungen der jeweils anderen Steuerungsalgorithmen unterrichtet.In a variant of the method, the target is set in such a way that each control algorithm should move its agent to a specified target position as quickly as possible. In this case, the control algorithm that moves its agent to the target position in the shortest possible time - in compliance with the boundary conditions, e.g. the rules of conduct and without colliding with an obstacle or at least one other agent - receives the most positive incentive (i.e. the highest or strongest reward). The control algorithm whose agent causes an accident and is therefore unable to meet the target receives a corresponding negative incentive. If the target is met at a later point in time, the corresponding control algorithm a positive incentive with a "lower" level, ie for example a less strong or large reward. Here, too, all control algorithms are preferably informed of the positive or negative incentives of the respective other control algorithms.

In einer besonders zweckmäßigen Verfahrensvariante wird die Zielvorgabe auf einen kollektiven Erfolg aller Steuerungsalgorithmen gerichtet. Insbesondere beinhaltet die Zielvorgabe dabei (konkret als Ziel) ein möglichst schnelles Erreichen einer Zielposition durch alle Agenten. Vorteilhafterweise wird hierbei in dem Trainingsdurchlauf, vorzugsweise in den mehreren aufeinander folgenden Trainingsdurchläufen nicht die Zielfindungsdauer nur eines einzelnen Agenten verringert - was nicht zwangsläufig zu einer Verringerung bei allen Steuerungsalgorithmen führen muss -, sondern die Durchschnittsdauer aller autonom bewegter Agenten, die gemäß dieser Verfahrensvariante trainiert werden. In diesem Fall erhält insbesondere der Steuerungsalgorithmus, der als erster und gegebenenfalls deutlich vor dem oder den anderen Steuerungsalgorithmen die Zielvorgabe erfüllt, unter Umständen dabei auch noch Agenten anderer Steuerungsalgorithmen an ihrer Erfüllung der Zielvorgabe (zumindest temporär) hindert, nicht die größte Belohnung sondern insbesondere eine geringerwertige, gegebenenfalls eine negative.In a particularly expedient variant of the method, the target is aimed at a collective success of all control algorithms. In particular, the target specification (specifically as a target) includes reaching a target position as quickly as possible by all agents. Advantageously, in the training run, preferably in the several consecutive training runs, the target finding time of just one individual agent is not reduced - which does not necessarily have to lead to a reduction in all control algorithms - but the average time of all autonomously moving agents that are trained according to this method variant . In this case, the control algorithm that meets the target first and possibly well before the other control algorithm(s), possibly also preventing agents of other control algorithms from fulfilling the target (at least temporarily), does not receive the greatest reward, but in particular one inferior ones, possibly a negative one.

Grundsätzlich ist aufgrund des vorstehend beschriebenen Trainings der jeweiligen Steuerungsalgorithmen in dem virtuellen Trainingsraum das hierbei antrainierte Bewegungsverhalten unabhängig von menschlichen Verhaltensmustern. Anders ausgedrückt sind die jeweiligen Steuerungsalgorithmen - zumindest bei einer entsprechenden „unvoreingenommenen“ „Grundprogrammierung“ - nicht durch menschliches Verhalten vorgeprägt, so dass zu erwarten ist, dass die jeweiligen Steuerungsalgorithmen auch bisher unbekannte oder unerwartete Verhaltensmuster (insbesondere neue Strategien für das Erfüllen der Zielvorgabe) entwickeln können. Insbesondere für den Fall, dass die Zielvorgabe auf ein möglichst schnelles gemeinsames oder kollektives Erreichen der Zielposition gerichtet ist, kann somit der jeweilige Steuerungsalgorithmus potentiell menschliches Verhalten übertreffen und - insbesondere im Bereich von autonom bewegbaren Kraftfahrzeugen - einen verbesserten Verkehrsfluss ermöglichen. So kann es bspw. in bestimmten Situationen sinnvoll sein, dass sich ein Agent etwas langsamer bewegt und einem anderen Agenten den Vorrang gewährt, wenn dadurch die Gesamtzeit aller Maschinen (Agenten) bis zum Erreichen der Zielposition minimiert wird. Für einen menschlichen Bediener ist dagegen häufig kaum abzusehen, welche Auswirkung die eigene Bewegung auf die Gesamtzeit aller Agenten hat.In principle, due to the above-described training of the respective control algorithms in the virtual training room, the movement behavior trained here is independent of human behavior patterns. In other words, the respective control algorithms - at least with a corresponding "unbiased" "basic programming" - are not predetermined by human behavior, so that it can be expected that the respective control algorithms also have previously unknown or unexpected behavior patterns (in particular new strategies for meeting the target). can develop. In particular if the target is aimed at reaching the target position as quickly as possible together or collectively, the respective control algorithm can thus surpass potential human behavior and—particularly in the area of autonomously moving motor vehicles—enable improved traffic flow. In certain situations, for example, it can make sense for an agent to move a little more slowly and give priority to another agent if this minimizes the total time for all machines (agents) to reach the target position. For a human operator, on the other hand, it is often difficult to foresee the effect of one's own movement on the total time of all agents.

In einer weiteren Verfahrensvariante wird die - optional auf den kollektiven Erfolg gerichtete - Zielvorgabe für jeden Steuerungsalgorithmus zusätzlich auf eine individuelle Zielposition ausgerichtet. Mit anderen Worten erhält jeder Steuerungsalgorithmus eine eigene Zielposition, die zu erreichen ist. Insbesondere für den Fall, dass die Zielvorgabe auf den kollektiven Erfolg gerichtet ist, bewegen sich die Agenten somit nicht - zumindest im Fall von Kraftfahrzeugen - in einem Konvoi oder als Flotte auf das gleiche Ziel zu. Vorzugsweise kommt es in diesem Fall häufiger vor, dass die einzelnen Steuerungsalgorithmen entscheiden müssen, ob der bspw. die eigene Bewegungsroute schneidende Agent Vorrang erhalten sollte, obwohl gegebenenfalls der eigene Agent eine Vorrechtsstellung (bspw. beim Spurwechsel auf einer Autobahn) innehat.In a further variant of the method, the target - optionally aimed at collective success - is additionally aligned to an individual target position for each control algorithm. In other words, each control algorithm has its own target position to be reached. Thus, particularly in the event that the objective is aimed at collective success, the agents do not move towards the same objective in a convoy or as a fleet, at least in the case of motor vehicles. In this case, it preferably happens more frequently that the individual control algorithms have to decide whether the agent intersecting one's own movement route, for example, should be given priority, although one's own agent may have a priority position (eg when changing lanes on a freeway).

In einer Verfahrensvariante wird die vorstehend beschriebene - die individuelle Zielposition enthaltende - Zielvorgabe jedem Steuerungsalgorithmus als individuelle Zielvorgabe zugeführt. Mithin kennen die anderen Steuerungsalgorithmen nur die eigene Zielposition, die des jeweils anderen Steuerungsalgorithmus aber nicht. Dies entspricht insbesondere im Straßenverkehr einer üblichen Situation, so dass ein realitätsnahes Trainieren ermöglicht ist.In a variant of the method, the target specification described above—containing the individual target position—is supplied to each control algorithm as an individual target specification. Consequently, the other control algorithms only know their own target position, but not that of the respective other control algorithm. This corresponds to a normal situation, particularly in road traffic, so that realistic training is possible.

In einer zweckmäßigen Verfahrensvariante sind die Steuerungsalgorithmen dazu ausgebildet, zumindest ihre individuellen Zielpositionen untereinander zu kommunizieren. Bspw. steuern die jeweiligen Steuerungsalgorithmen hierzu ihren jeweiligen Agenten an, bspw. ein (insbesondere virtuelles) Fahrzeug-zu-Fahrzeug-Kommunikationssystem zur Übermittlung der eigenen Zielposition einzusetzen. Bspw. erfolgt die Kommunikation nur für den Fall, dass sich die entsprechenden Bewegungsrouten der Agenten schneiden. Insbesondere in diesem Fall ist nämlich die gegenseitige Kenntnis der Zielposition oder zumindest der Bewegungsroute vorteilhaft. Insbesondere kann durch diese Kommunikation die Entscheidung einzelner Steuerungsalgorithmen hinsichtlich der Gewährung von Vorrang für einen anderen Steuerungsalgorithmus hinsichtlich des kollektiven Erfolgs unterstützt werden.In an expedient variant of the method, the control algorithms are designed to communicate at least their individual target positions with one another. For example, the respective control algorithms control their respective agents for this purpose, for example using a (particularly virtual) vehicle-to-vehicle communication system to transmit their own target position. For example, communication only takes place in the event that the corresponding movement routes of the agents intersect. In this case in particular, mutual knowledge of the target position or at least the movement route is advantageous. In particular, the decision of individual control algorithms with regard to granting priority to another control algorithm with regard to collective success can be supported by this communication.

In einer bevorzugten Ausführung wird die Anzahl der Randbedingungen vorzugsweise automatisiert erhöht, wenn ein vorgegebenes Kriterium erfüllt wird. Durch die Erhöhung der Anzahl der Randbedingungen, bspw. der Steigerung der Zahl und Art von Hindernissen, einer Ausdehnung von Hindernissen in die dritte Dimension (insbesondere ausgehend von nur zweidimensional abgebildeten Hindernissen), der Hinzunahme von Routenkreuzungen, Alternativrouten und dergleichen wird vorzugsweise die Komplexität der in dem Trainingsraum dargestellten Umgebung (Szenarios) erhöht, vorzugsweise bis die Komplexität der Realität entspricht. Als Kriterium wird beispielsweise herangezogen, ob die Zeit, die die Steuerungsalgorithmen zur (insbesondere fehlerfreien) Erfüllung der Zielvorgabe (individuell oder kollektiv) benötigen, über wenigstens zwei aufeinanderfolgende Trainingsdurchläufe auf einem vorzugsweise hinreichend niedrigen Niveau verbleibt. Nähert sich die Komplexität der virtuellen Umgebung der realen Umgebung an oder entspricht dieser, ist davon auszugehen, dass insbesondere eine im Trainingsraum erreichte Fehlerrate auch in der realen Umgebung erreicht werden kann. Die Automatisierung der Erhöhung der Randbedingungen erfolgt beispielweise mittels eines Algorithmus, der die Erfüllung des Kriteriums prüft und daraufhin die Randbedingungen um eine vorgegebene Anzahl erhöht, bspw. indem für den nächsten Trainingsdurchlauf ein komplexeres Szenario geladen wird.In a preferred embodiment, the number of boundary conditions is preferably automatically increased when a predetermined criterion is met. By increasing the number of boundary conditions, e.g. increasing the number and type of obstacles, expanding obstacles into the third dimension (in particular starting from obstacles that are only mapped in two dimensions), adding route crossings, alternative routes and the like, the complexity of the Environment (scenarios) presented in the training room is increased, preferably until the complexity of reality ent speaks. The criterion used is, for example, whether the time that the control algorithms need to (in particular error-free) meet the target specification (individually or collectively) remains at a preferably sufficiently low level over at least two consecutive training runs. If the complexity of the virtual environment approaches or corresponds to the real environment, it can be assumed that an error rate achieved in the training room can also be achieved in the real environment. The increase in the boundary conditions is automated, for example, by means of an algorithm that checks whether the criterion is met and then increases the boundary conditions by a specified number, for example by loading a more complex scenario for the next training run.

Zweckmäßigerweise wird zur Überwachung des Trainings auch diese Fehlerrate insbesondere aller Steuerungsalgorithmen ermittelt. Bei der Fehlerrate handelt es sich beispielsweise um die Anzahl der Verletzung von Randbedingungen stets des gleichen Steuerungsalgorithmus über mehrere Trainingsdurchläufe hinweg, oder bspw. bei auf das kollektive Ergebnis gerichteter Zielvorgabe die Anzahl der Fehler aller Steuerungsalgorithmen pro Trainingsdurchlauf. Vorzugsweise wird das Trainieren beendet, wenn ein vorgegebenes Abbruchkriterium, insbesondere eine vorgegebene Zielfehlerrate erreicht wird.This error rate, in particular of all control algorithms, is also expediently determined in order to monitor the training. The error rate is, for example, the number of violations of boundary conditions of the same control algorithm over several training runs, or, for example, the number of errors in all control algorithms per training run when the target is set for the collective result. The training is preferably ended when a predetermined termination criterion, in particular a predetermined target error rate, is reached.

Als weitere Randbedingung wird in einer weiteren optionalen Variante vorgegeben, dass Einsatzkräften (bspw. Rettungswagen, Polizei, Feuerwehr etc.) oder zumindest einem Agenten, der eine Dringlichkeit (bspw. einen Notfall) an den jeweiligen anderen Agenten kommuniziert, Vorrang zu gewähren ist. Optional ist eine individuelle Zielvorgabe darauf gerichtet, dass der einzelne Steuerungsalgorithmus eine Einsatzkraft darstellt und/oder unter Dringlichkeit (bspw. eine verletzte Person muss möglichst schnell in ein Krankenhaus) seine Zielposition erreichen muss.As a further boundary condition, another optional variant specifies that emergency services (e.g. ambulances, police, fire brigade, etc.) or at least one agent who communicates an urgency (e.g. an emergency) to the other agent in question is to be given priority. Optionally, an individual target specification is aimed at the fact that the individual control algorithm represents an emergency responder and/or has to reach its target position under urgency (e.g. an injured person has to get to a hospital as quickly as possible).

Um das Verhalten der Steuerungsalgorithmen unter Interaktion mit menschlichem Verhalten verifizieren (insbesondere auf Fehlerfreiheit testen) und/oder trainieren zu können, wird zumindest in einem Trainingsdurchlauf ein Agent von Bedienpersonal - d. h. insbesondere von einem menschlichen Bediener - durch den Trainingsraum bewegt. Da menschliche Bediener sich wie vorstehend beschrieben potentiell anders verhalten als ein autonomer Steuerungsalgorithmus, ist diese Verfahrensvariante vorteilhaft.In order to be able to verify and/or train the behavior of the control algorithms while interacting with human behavior (in particular testing for correctness) and/or to be able to train them, at least in one training run, an agent is assigned to operating personnel--i.e. H. particularly by a human operator - moved through the training room. Since human operators, as described above, potentially behave differently than an autonomous control algorithm, this variant of the method is advantageous.

Als eigenständige Erfindung wird außerdem ein Trainingssystem angegeben, das ein Rechnersystem aufweist, das zur Bereitstellung des virtuellen Trainingsraums sowie zur Bereitstellung von Rechenkapazität für den jeweiligen Steuerungsalgorithmus eingerichtet ist. Vorzugsweise ist das Trainingssystem also mittels des Rechnersystems dazu eingerichtet, das vorstehend beschriebene Verfahren, das insbesondere in Form eines Trainingsprogramms auf dem Rechnersystem implementiert ist, zum Trainieren der Steuerungsalgorithmen insbesondere selbsttätig, gegebenenfalls unter Interaktion mit Bedienpersonal, durchzuführen.A training system is also specified as an independent invention that has a computer system that is set up to provide the virtual training room and to provide computing capacity for the respective control algorithm. The training system is preferably set up by means of the computer system to carry out the method described above, which is implemented in particular in the form of a training program on the computer system, for training the control algorithms, in particular automatically, possibly with interaction with operating personnel.

Als wiederum weitere, eigenständige Erfindung wird ein Steuerungsalgorithmus, der mittels des vorstehend beschriebenen Verfahrens, insbesondere mittels des Trainingssystems trainiert wurde, zum autonomen Bewegen einer autonom bewegbaren Vorrichtung, vorzugsweise eines Fahrzeugs, bspw. eines Kraftfahrzeugs oder eines Flugzeugs, oder optional eines (Industrie-)Roboters, verwendet.As yet another independent invention, a control algorithm, which was trained using the method described above, in particular using the training system, for autonomously moving an autonomously movable device, preferably a vehicle, for example a motor vehicle or an aircraft, or optionally an (industrial )robots, used.

Die erfindungsgemäße autonom bewegbare Vorrichtung, insbesondere das (Kraft-)Fahrzeug, weist eine Steuerungsvorrichtung (bspw. einen Controller mit einem zugeordneten Mikroprozessor und einem Datenspeicher) auf, auf der ein nach dem vorstehend beschriebenen Verfahren trainierter Steuerungsalgorithmus zur Durchführung eines autonomen Betriebsmodus insbesondere lauffähig installiert ist.The autonomously movable device according to the invention, in particular the (motor) vehicle, has a control device (e.g. a controller with an assigned microprocessor and a data memory) on which a control algorithm trained according to the method described above for carrying out an autonomous operating mode is installed, in particular executable is.

Die Konjunktion „und/oder“ ist hier und im Folgenden insbesondere derart zu verstehen, dass die mittels dieser Konjunktion verknüpften Merkmale sowohl gemeinsam als auch als Alternativen zueinander ausgebildet sein können.Here and in the following, the conjunction “and/or” is to be understood in particular in such a way that the features linked by means of this conjunction can be designed both together and as alternatives to one another.

Nachfolgend werden Ausführungsbeispiele der Erfindung anhand einer Zeichnung näher erläutert. Darin zeigen:

1 in einem schematischen Blockdiagramm ein Verfahren zum Trainieren von selbstlernenden Steuerungsalgorithmen,
2 in einer schematischen Ansicht einen virtuellen Trainingsraum für die selbstlernenden Steuerungsalgorithmen, und
3 ein Kraftfahrzeug, auf dem einer der Steuerungsalgorithmen installiert ist.

Exemplary embodiments of the invention are explained in more detail below with reference to a drawing. Show in it:

1 in a schematic block diagram a method for training self-learning control algorithms,
2 a schematic view of a virtual training room for the self-learning control algorithms, and
3 a motor vehicle on which one of the control algorithms is installed.

Einander entsprechende Teile und Größen sind in allen Figuren stets mit gleichen Bezugszeichen versehen.Corresponding parts and sizes are always provided with the same reference symbols in all figures.

In 1 ist schematisch ein Verfahren zum Trainieren selbstlernender Steuerungsalgorithmen 2 für autonom bewegbare Vorrichtungen dargestellt, die im vorliegenden Ausführungsbeispiel jeweils durch ein Personenkraftfahrzeug, kurz Fahrzeug 4 (s. 3) gebildet sind. Das Verfahren wird dabei von einem nicht näher dargestellten Trainingssystem, das konkret durch ein Rechnersystem (bspw. eine Workstation oder dergleichen) gebildet ist, durchgeführt.In 1 a method for training self-learning control algorithms 2 for autonomously movable devices is shown schematically, which in the present exemplary embodiment is carried out by a passenger vehicle, vehicle 4 for short (see 3 ) are formed. The procedure is going on carried out by a training system not shown in detail, which is specifically formed by a computer system (e.g. a workstation or the like).

Verfahrensgemäß wird dabei in einem ersten Verfahrensschritt S1 ein virtueller Trainingsraum 6 (s. 2) bereitgestellt. In diesem Trainingsraum 6 sind in diesem Ausführungsbeispiel als Randbedingungen befahrbare Wege (hier: Straßen 8), ein unbewegliches und unüberfahrbares Hindernis 10, ein Fußgängerüberweg 12, ein Fußgänger 14 und ein Verkehrszeichen 16 abgebildet. Als weitere Randbedingungen enthält der Trainingsraum 6 (konkret das darin abgebildete Szenario) Verhaltensregeln für die Steuerungsalgorithmen 2.According to the method, in a first method step S1, a virtual training room 6 (see 2 ) provided. In this embodiment, paths that can be driven on (here: roads 8), an immovable obstacle 10 that cannot be driven over, a pedestrian crossing 12, a pedestrian 14 and a traffic sign 16 are shown as boundary conditions in this training room 6. The training room 6 (specifically the scenario depicted therein) contains behavioral rules for the control algorithms 2 as additional boundary conditions.

In einem zweiten Verfahrensschritt S2 wird jedem Steuerungsalgorithmus 2 jeweils eine Zielvorgabe zugeführt, die der jeweilige Steuerungsalgorithmus 2 dadurch erfüllen muss, dass er im Rahmen eines Trainingsdurchlaufs (d. h. während eines solchen Trainingsdurchlaufs) einen virtuellen Agenten 18 - hier ein virtuelles Fahrzeug - durch den Trainingsraum 6 zu einer Zielposition 20 bewegt. Im vorliegenden Ausführungsbeispiel erhalten zwei Steuerungsalgorithmen 2 jeweils eine eigene, individuelle Zielvorgabe, die die (gegebenenfalls individuell unterschiedliche) Zielposition 20 enthält und die außerdem darauf gerichtet ist, dass die von diesen Steuerungsalgorithmen 2 bewegten Agenten 18 gemeinsam möglichst schnell zur Zielposition 20 (d. h. kollektiver Erfolg aller Steuerungsalgorithmen 2) gelangen sollen. Mithin soll ein Verhalten trainiert werden, das zu einer Verringerung der gemeinsam benötigten Zeit zum Erreichen der Zielposition 20 führt. Die beiden Agenten 18 werden dabei zur gleichen Zeit durch den Trainingsraum 6 bewegt.In a second method step S2, each control algorithm 2 is supplied with a target specification that the respective control algorithm 2 must meet by driving a virtual agent 18 - here a virtual vehicle - through the training room 6 moved to a target position 20. In the present exemplary embodiment, two control algorithms 2 each receive their own individual target specification, which contains the (possibly individually different) target position 20 and which is also intended to ensure that the agents 18 moved by these control algorithms 2 together reach the target position 20 as quickly as possible (i.e. collective success of all control algorithms 2) should arrive. Consequently, a behavior is to be trained that leads to a reduction in the time required to reach the target position 20 together. The two agents 18 are moved through the training room 6 at the same time.

In einem nachfolgenden Verfahrensschritt S3 wird überprüft ob jeder Steuerungsalgorithmus 2 die Zielvorgabe erfüllt hat oder ob einer der Steuerungsalgorithmen 2 die Zielvorgabe nicht erfüllt hat, konkret gegen eine Randbedingung verstoßen hat. Dabei erfolgen eine positive Incentivierung bei Erfüllung der Zielvorgabe und/oder eine negative Incentivierung bei NichtErfüllung an den entsprechenden Steuerungsalgorithmus 2. Dem jeweiligen anderen Steuerungsalgorithmus 2 wird dabei die positive bzw. negative Incentivierung sowie das dazu führende Verhalten des entsprechend anderen Steuerungsalgorithmus 2 zugeführt, so dass dieser dieses Verhalten in seinen Lernprozess mit einbeziehen kann.In a subsequent method step S3, it is checked whether each control algorithm 2 has met the target specification or whether one of the control algorithms 2 has not met the target specification, specifically violated a boundary condition. A positive incentive is given to the corresponding control algorithm 2 if the target is met and/or a negative incentive is given if the target is not met can include this behavior in his learning process.

In einem nachfolgenden Verfahrensschritt S4 wird eine Fehlerrate der Steuerungsalgorithmen 2 ermittelt. Solange ein Grenzwert der Fehlerrate (d. h. eine Zielfehlerrate) noch nicht erreicht, konkret unterschritten wird, erfolgt eine Wiederholung des Trainingsdurchlaufs, konkret ein Rücksprung auf den Verfahrensschritt S2. Unterschreitet die Fehlerrate den Grenzwert, wird in einer Iteration im Verfahrensschritt S1 die Komplexität des Trainingsraums 6 konkret mittels einer Erhöhung der Anzahl der Randbedingungen gesteigert.In a subsequent method step S4, an error rate of the control algorithms 2 is determined. As long as the error rate has not yet reached a limit value (i.e. a target error rate), the training run is repeated, specifically a jump back to method step S2. If the error rate falls below the limit value, the complexity of the training space 6 is specifically increased in an iteration in method step S1 by increasing the number of boundary conditions.

Diese Iteration wird fortgesetzt, bis die Komplexität der durch den Trainingsraum 6 bereitgestellten Umgebung der Komplexität der realen Umgebung gleicht und die Fehlerrate den Grenzwert unterschreitet. In diesem Ausführungsbeispiel wird das Trainieren hiernach beendet und der jeweilige Steuerungsalgorithmus 2 ist einsatzbereit.This iteration is continued until the complexity of the environment provided by the training room 6 equals the complexity of the real environment and the error rate falls below the limit value. In this exemplary embodiment, the training is then ended and the respective control algorithm 2 is ready for use.

In einem nicht näher dargestellten Ausführungsbeispiel erfolgt in diesem Fall wenigstens eine weitere Iteration, wobei sich zusätzlich wenigstens ein von einem (menschlichen) Bediener bewegter Agent 18 in dem Trainingsraum 6 befindet. Dadurch wird die Interaktion der von den Steuerungsalgorithmen 2 bewegten Agenten 18 mit menschlich geprägtem Verhalten überprüft und gegebenenfalls trainiert.In an exemplary embodiment that is not shown in detail, at least one further iteration takes place in this case, with at least one agent 18 moved by a (human) operator also being located in the training room 6 . As a result, the interaction of the agents 18 moved by the control algorithms 2 with human behavior is checked and, if necessary, trained.

In einem weiteren nicht näher dargestellten Ausführungsbeispiel sind die Steuerungsalgorithmen 2 derart eingerichtet, im Verfahrensschritt S2, d. h. wenn sie ihre zugeordneten Agenten 18 durch den Trainingsraum 6 bewegen, ihre individuellen Zielpositionen 20 untereinander zu kommunizieren. Dazu nutzen die Steuerungsalgorithmen 2 ein in dem jeweiligen virtuellen Agenten 18 implementiertes Kommunikationssystem.In a further exemplary embodiment, which is not shown in detail, the control algorithms 2 are set up in such a way that in method step S2, i. H. as they move their associated agents 18 through the training space 6 to communicate their individual target locations 20 to one another. For this purpose, the control algorithms 2 use a communication system implemented in the respective virtual agent 18 .

In 3 ist das Fahrzeug 4 dargestellt, das mittels des in der vorstehend beschriebenen Art und Weise trainierten Steuerungsalgorithmus 2 autonom bewegt werden kann. Dazu weist das Fahrzeug 4 eine Steuerungsvorrichtung, hier ein Fahrzeugcontroller 22, auf. Der Fahrzeugcontroller 22 ist konkret durch einen Mikroprozessor mit einem zugeordneten Speicher gebildet. Auf dem Speicher ist der Steuerungsalgorithmus 2 lauffähig installiert, so dass dieser bei Ausführung durch den Mikroprozessor ein Steuerungsverfahren zum Bewegen des Fahrzeugs 4 automatisch durchführt. Das Fahrzeug 4 weist außerdem ein Sensorsystem zur Erzeugung von für die Umgebung des Fahrzeugs 4 repräsentativen Sensorsignalen auf. Das Sensorsystem ist hier beispielhaft durch ein Kamerasystem 24 angedeutet. Die Sensordaten werden über eine entsprechende Schnittstelle 26 dem Steuerungsalgorithmus 2 zugeführt, der die Sensordaten im autonomen Betrieb auswertet und daraufhin (sowie gemäß seinem antrainierten Verhalten) seine Steuerungsentscheidungen fällt. Die Steuerungsentscheidungen gibt der Steuerungsalgorithmus 2 in Form von entsprechenden Anweisungen A an verschiedene Subcontroller 28, z. B. Lenkungscontroller, Bremsencontroller, Motorcontroller und dergleichen, weiter. Die Subcontroller 28 setzen diese Anweisungen A anschließend in entsprechende Steuersignale um.In 3 the vehicle 4 is shown, which can be moved autonomously by means of the control algorithm 2 trained in the manner described above. For this purpose, the vehicle 4 has a control device, here a vehicle controller 22 . The vehicle controller 22 is concretely constituted by a microprocessor with an associated memory. The control algorithm 2 is installed in the memory so that it can run, so that when it is executed by the microprocessor, it automatically carries out a control method for moving the vehicle 4 . The vehicle 4 also has a sensor system for generating sensor signals that are representative of the surroundings of the vehicle 4 . The sensor system is indicated here by a camera system 24 as an example. The sensor data are fed via a corresponding interface 26 to the control algorithm 2, which evaluates the sensor data in autonomous operation and then (and according to its trained behavior) makes its control decisions. The control algorithm 2 gives the control decisions in the form of corresponding instructions A to various subcontrollers 28, e.g. B. steering controller, brake controller, motor controller and the like, further. the sub Controllers 28 then convert these instructions A into corresponding control signals.

Der Gegenstand der Erfindung ist nicht auf die vorstehend beschriebenen Ausführungsbeispiele beschränkt. Vielmehr können weitere Ausführungsformen der Erfindung von dem Fachmann aus der vorstehenden Beschreibung abgeleitet werden. Insbesondere können die anhand der verschiedenen Ausführungsbeispiele beschriebenen Einzelmerkmale der Erfindung und deren Ausgestaltungsvarianten auch in anderer Weise miteinander kombiniert werden.The subject matter of the invention is not limited to the exemplary embodiments described above. Rather, further embodiments of the invention can be derived by the person skilled in the art from the above description. In particular, the individual features of the invention and their design variants described with reference to the various exemplary embodiments can also be combined with one another in other ways.

Claims

Method for training self-learning control algorithms (2) for autonomously movable devices (4), according to the method - a virtual training room (6) is provided with a number of boundary conditions, - In a training run, at least two independently implemented and self-learning control algorithms (2) move a virtual agent (18) in the training room (6) at the same time to meet a target specification, - at least one of the control algorithms (2) is given a positive incentive and/or a negative incentive if the target is met or if a boundary condition is not met and/or a boundary condition is exceeded by the respective control algorithm (2), with at least the negative incentive being given to one control algorithm (2) is supplied to the at least one other control algorithm (2), an agent (18) being moved through the training room (6) by operating personnel in order to train the behavior of the control algorithms while interacting with human behavior in a training run.

procedure after claim 1 , wherein the positive incentive of a control algorithm (2) is supplied to the at least one other control algorithm (2).

procedure after claim 1 or 2 , wherein each control algorithm (2) is supplied with information about the behavior of the respective other control algorithm (2) that leads to the positive or negative incentive.

Procedure according to one of Claims 1 until 3 , where the target is aimed at a collective success of all control algorithms (2).

Procedure according to one of Claims 1 until 4 , wherein the target specification for each control algorithm (2) is additionally aligned to an individual target position.

Procedure according to one of Claims 1 until 5 , wherein the target specification is supplied to each control algorithm (2) as an individual target specification.

procedure after claim 5 and 6 , wherein the control algorithms (2) communicate their individual target positions with each other.

Procedure according to one of Claims 1 until 7 , where the number of boundary conditions is increased if a given criterion is met.

Autonomously movable device (4), in particular a motor vehicle, with a control device (22) on which a by the method according to one of Claims 1 until 8th trained control algorithm (2) is installed to carry out an autonomous operating mode.