DE10021929A1

DE10021929A1 - Computerized determination of control strategy for technical system involves using reinforcement learning to determine control strategy for each state and learn optimal actions

Info

Publication number: DE10021929A1
Application number: DE2000121929
Authority: DE
Inventors: Martin Appl
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 2000-05-05
Filing date: 2000-05-05
Publication date: 2001-11-15
Also published as: WO2001086359A3; WO2001086359A2

Abstract

The method involves describing the system using a continuous state space and an action space. A state change assessment is performed and a model of the technical system is determined using training data describing the system by forming fuzzy association functions. The functions are fed to a reinforcement learning method with which a control strategy is determined for each state in state space and the optimal actions in action space are learned. The method involves describing the system using a continuous state space and an action space, whereby the state space has states that the technical system can adopt and the action space has actions that are carried out to produce a state change from a previous state to a subsequent state in state space. An assessment of the state change is performed and a model of the technical system is determined using training data describing the system by forming fuzzy association functions with which at least the state space state space is described. The fuzzy association functions are fed to a reinforcement learning method with which a control strategy is determined for each state in state space and the optimal actions in action space are learned. Independent claims are also included for the following: a fuzzy controller for determining a control strategy for a technical system and a computer-readable storage medium.

Description

Die Erfindung betrifft ein Verfahren und eine Fuzzy- Steuervorrichtung zum Ermitteln einer Steuerungsstrategie für ein technisches System sowie ein Computerlesbares Speichermedium und ein Computerprogramm-Element.The invention relates to a method and a fuzzy Control device for determining a control strategy for a technical system and a computer readable Storage medium and a computer program element.

Ein solches Verfahren und eine solche Fuzzy-Steuervorrichtung sind aus [1] und [3] bekannt.Such a method and such a fuzzy control device are known from [1] and [3].

Bei dem aus [3] bekannten Verfahren wird ein zu beschreibendes und zu steuerndes technisches System, welches ursprünglich mit einem kontinuierlichen Zustandsraum und einem kontinuierlichen Aktionsraum beschrieben wird, diskretisiert.In the method known from [3], a descriptive and controllable technical system which originally with a continuous state space and a continuous area of action is described, discretized.

Auf der Basis des diskretisierten Zustandsraums und des diskretisierten Aktionsraums wird das Reinforcement- Lernverfahren gemäß dem Prinzip des sogenannten "Prioritized Sweeping" durchgeführt.On the basis of the discretized state space and the discretized action space, the reinforcement Learning method according to the principle of the so-called "Prioritized Sweeping ".

Diese bekannte Vorgehensweise hat insbesondere den Nachteil, dass entweder eine sehr feine Partitionierung des kontinuierlichen Raums erforderlich ist, woraus sich eine große Komplexität des zu lösenden diskreten Problems mit dem daraus resultierenden sehr großen Rechenzeitbedarf und dem damit ferner verbundenen erheblichen Speicherplatzbedarf im Rahmen der Steuerung eines technischen Systems ergibt.This known procedure has the particular disadvantage that that either a very fine partitioning of the continuous space is required, resulting in a great complexity of the discrete problem to be solved with the resulting very large computing time requirements and associated with considerable space requirements in the Under the control of a technical system results.

Ist die Partitionierung jedoch gröber, so wird die Approximation des zu steuernden technischen Systems sehr ungenau. Dies führt zu einer suboptimalen, das heißt zu einer relativ schlechten Steuerstrategie, die gemäß dem Reinforcement-Lernen ermittelt wird.However, if the partitioning is coarser, the Approximation of the technical system to be controlled very much inaccurate. This leads to a suboptimal, that is to say one relatively poor tax strategy, according to the Reinforcement learning is determined.

Um die erreichbare Approximationsgenauigkeit zu verbessern, ist es aus [4] bekannt, eine Interpolationsstrategie zu verwenden, was grundsätzlich dem Einsatz eines sogenannten, in [1] beschriebenen Takagi-Sugeno-Systems mit konstanten Konsequenzen in den Regeln entspricht.To improve the achievable approximation accuracy, it is known from [4] to use an interpolation strategy use what is basically the use of a so-called, described in [1] Takagi-Sugeno system with constant Consequences in the rules corresponds.

Bei dem aus [4] bekannten Verfahren wird jedoch zum Training der Werte in den Zentren des Interpolationsschemas eine harte Partitionierung des Zustandsraums und des Aktionsraums durchgeführt, weshalb sich wieder die oben zuvor dargestellten Nachteile ergeben.In the method known from [4], however, training becomes the values in the centers of the interpolation scheme a hard one Partitioning the state space and the action space carried out, which is why the above again disadvantages shown.

Weiterhin ist es aus [2] bekannt, Fuzzy-Partitionen mittels eines Fuzzy-C-Means-Clustering-Verfahrens bekannt.It is also known from [2] to use fuzzy partitions of a fuzzy C mean clustering method is known.

Somit liegt der Erfindung das Problem zugrunde, eine Steuerungsstrategie für ein technisches System unter Verwendung eines Reinforcement-Lernverfahrens anzugeben, bei dem gegenüber dem aus [3] bekannten Verfahren eine verbesserte Steuerungsstrategie ermittelt wird.The invention is therefore based on the problem of a Control strategy for a technical system under Use a reinforcement learning process to specify at compared to the method known from [3] improved control strategy is determined.

Das Problem wird durch das Verfahren, die Fuzzy- Steuerungsvorrichtung zum rechnergestützten Ermitteln einer Steuerungsstrategie für ein technisches System, durch das Computerlesbare Speichermedium und durch ein Computerprogramm-Element mit den Merkmalen gemäß den unabhängigen Patentansprüchen gelöst.The problem is exacerbated by the process, the fuzzy Control device for computer-aided determination of a Control strategy for a technical system through which Computer readable storage medium and through a Computer program element with the features according to the independent claims solved.

Bei einem Verfahren zum rechnergestützten Verfahren Ermitteln einer Steuerungsstrategie für ein technisches System wird das technische System mit einem kontinuierlichen Zustandsraum und einem kontinuierlichen oder diskretisierten Aktionsraum beschrieben. Der Zustandsraum weist Zustände auf, die das technische System grundsätzlich annehmen kann. Ferner weist der Aktionsraum Aktionen auf, die ausgeführt werden, um einen Zustandsübergang von einem Vorgängerzustand des Zustandsraum in einen Nachfolgezustand des Zustandsraums zu erzeugen. Mit Trainingsdaten, die das technische System beschreiben, wird ein Modell des technischen Systems ermittelt und in Fuzzy- Partitionen gruppiert, indem Fuzzy-Zugehörigkeitsfunktionen zu den Fuzzy-Partitionen gebildet werden, mit denen zumindest der Zustandsraum beschrieben wird.Determine a procedure for the computer-aided procedure This becomes a control strategy for a technical system technical system with a continuous state space and a continuous or discretized area of action described. The state space has states that the technical system can basically adopt. Further points the action space on actions that are carried out to a State transition from a previous state of the state space to generate in a successor state of the state space. With Training data that describe the technical system a model of the technical system is determined and in fuzzy Partitions grouped by fuzzy membership functions to the fuzzy partitions with which at least the state space is described.

Es erfolgt eine Bewertung des Zustandsübergangs derart, dass eine Optimierung hinsichtlich der Bewertungen zu einer optimalen Steuerungsstrategie führt.The state transition is evaluated in such a way that an optimization with regard to the ratings to a leads optimal control strategy.

Unter Verwendung der Fuzzy-Zugehörigkeitsfunktionen wird ein Reinforcement-Lernverfahren zumindest für den Zustandsraum durchgeführt, wodurch jeweils eine Steuerungsstrategie, d. h. jeweils eine Aktion pro Zustand, ermittelt wird für jeden Zustand des Zustandsraums und eine Bewertung aller Zustands- Aktions-Paare durchgeführt wird. Das technische System wird unter Berücksichtigung der Steuerungsstrategie mittels Steuergrößen, die beispielsweise abhängig von der Steuerungsstrategie ausgewählt bzw. gebildet werden, gesteuert.Using the fuzzy membership functions, a Reinforcement learning process at least for the state space carried out, whereby a control strategy, d. H. one action per state, determined for each State of the state space and an evaluation of all state Action pairs is performed. The technical system is taking into account the control strategy by means of Control variables that depend, for example, on the Control strategy are selected or formed, controlled.

Die einzelnen Partitionen werden im weiteren auch als Cluster bezeichnet.The individual partitions are also called clusters designated.

Durch die Erfindung wird insbesondere erreicht, dass die Approximationsgenauigkeit und damit das Ermitteln der Steuerungsgrößen erheblich beschleunigt, das heißt mit verringertem Rechenzeitbedarf durchführbar wird.The invention in particular ensures that the Approximation accuracy and thus the determination of the Control variables accelerated considerably, that is with reduced computing time is feasible.

Ferner wird die ermittelte Steuerungsstrategie hinsichtlich des Gewinns als Optimierungsgröße innerhalb des Reinforcement-Lernverfahrens erheblich gegenüber dem Stand der Technik verbessert. Furthermore, the determined control strategy with regard to of the profit as an optimization variable within the Reinforcement learning process significantly compared to the state technology improved.

Auch werden die Anzahl der zur Approximation des technischen Systems erforderlichen Partitionen, insbesondere die zur Beschreibung der Partitionen verwendeten Zentren einer Partition erheblich verringert.Also the number of the approximation of the technical Systems required partitions, in particular those for Description of the partitions used one centers Partition significantly reduced.

Aufgrund der Verringerung der benötigten Anzahl an Partitions-Zentren wird eine schnellere Berechnung der Q- Funktion im Rahmen des Reinforcement-Lernverfahrens bei höherer Genauigkeit ermöglicht.Due to the reduction in the number required Partition centers will have a faster calculation of the Q- Function in the context of the reinforcement learning process allows higher accuracy.

Bevorzugte Weiterbildungen der Erfindung ergeben sich aus den abhängigen Ansprüchen.Preferred developments of the invention result from the dependent claims.

Für jeden Zustand des Zustandsraums und die entsprechenden Aktionen des Aktionsraums kann jeweils ein Q-Wert als Steuerungsstrategie, d. h. der Bewertung der Ausführung der Aktion in dem Zustand ermittelt wird.For each state of the state space and the corresponding ones Actions of the action area can each have a Q value as Control strategy, d. H. the evaluation of the execution of the Action is determined in the state.

Ferner können in den Konklusionen der Fuzzy-Regeln des Fuzzy- Systems, welches gemäß dem Reinforcement-Lernverfahrens gebildet wird, welches die Q-Funktion approximiert, lineare Terme verwendet werden.Furthermore, in the conclusions of the fuzzy rules of the fuzzy Systems, which according to the reinforcement learning process is formed, which approximates the Q function, linear Terms are used.

Auf diese Weise wird ein schnell und dennoch exaktes Ermitteln der Steuerungsstrategie möglich, das das Modell sehr genau wiedergibt.In this way, it becomes a quick, yet accurate Determine the control strategy that is possible for the model reproduces very precisely.

Gemäß einer weiteren Ausgestaltung der Erfindung wird das Reinforcement-Lernverfahren durchgeführt, indem insbesondere solche Aktionen, d. h. solche Experimente ausgeführt werden, die ein vorgegebenes Kriterium erfüllen.According to a further embodiment of the invention Reinforcement learning procedures performed by in particular such actions, d. H. such experiments are carried out that meet a given criterion.

Auf diese Weise wird eine optimierte Auswahl von Aktionen, d. h. Experimenten möglich, wodurch die benötigte Anzahl von Experimenten minimiert wird und somit das Lernen beschleunigt wird. In this way, an optimized selection of actions, d. H. Experiments possible, reducing the number of Experiments are minimized and learning is accelerated becomes.

Das Kriterium kann ein zu erwartender Informationsgewinn über die bedingten Zustandsübergangs-Wahrscheinlichkeiten innerhalb des Reinforcement-Lernverfahrens sein.The criterion can be an expected information gain about the conditional transition probabilities be within the reinforcement learning process.

Aus einem Informationsgewinn kann jeweils ein zukünftiger Gewinn geschätzt werden. Es können insbesondere nur oder im wesentlichen nur die Aktionen ausgewählt und durchgeführt, die hinsichtlich des unmittelbar oder mittelbar zu erwartenden Informationsgewinn besser sind als ein vorgebbarer Mindest-Informationsgewinn.From a gain of information a future one can Profit can be estimated. In particular, it can only or essentially only the actions selected and carried out those regarding the directly or indirectly expected information gain is better than one predeterminable minimum information gain.

Die Erfindung kann vorteilhaft eingesetzt werden allgemein zur Steuerung eines Verkehrssystems als technisches System, insbesondere zur Steuerung, d. h. zur Auswahl eines Rahmensignalplans zur Steuerung von Ampeln innerhalb eines Verkehrsnetzes. Somit kann beispielsweise aufgrund der Steuerungsstrategie ein Rahmensignalplan ausgewählt werden und aufgrund des ausgewählten Rahmensignalplans können entsprechende Steuersignale an Ampeln eines Verkehrsnetzes übermittelt werden, die die Ampeln gemäß dem ausgewählten Rahmensignalplan ansteuern.The invention can be advantageously used in general to control a traffic system as a technical system, especially for control, d. H. to choose one Framework signal plan for controlling traffic lights within a Transport network. Thus, for example, due to the Control strategy a framework signal plan can be selected and based on the selected frame signal plan corresponding control signals at traffic lights of a traffic network are transmitted, the traffic lights according to the selected Activate the frame signal plan.

Für jede Fuzzy-Partition im Zustandsraum und in dem Aktionsraum kann ein Informationsgewinn ermittelt werden, der aus früheren Ausführungen von zu dieser entsprechenden Fuzzy- Partition gehörenden Aktionen in die entsprechenden Zustände resultierte.For each fuzzy partition in the state space and in that Action space can be determined an information gain that from previous explanations of fuzzy Actions belonging to the partition in the corresponding states resulted.

Gemäß einer weiteren Ausgestaltung der Erfindung sind Zähler vorgesehen, mit denen die Anzahl von Ausführungen von Aktionen in einem Zustand des technischen Systems und die Anzahl von Zustandsübergängen von einem Anfangszustand, d. h. einen Vorgängerzustand in einen Nachfolgezustand aufgrund der Aktion bis zu der Iteration angegeben wird. Die den Zählern zugeordneten Werte werden bei Ermitteln eines neuen Zustandsübergangs abhängig von dem Grad der Zugehörigkeit der Zustände bzw. der Zustandsübergänge zu den jeweiligen Fuzzy- Clustern aktualisiert.According to a further embodiment of the invention are counters provided with which the number of executions of Actions in a state of the technical system and the Number of state transitions from an initial state, i.e. H. a previous state into a successor state due to the Action until the iteration is specified. The counters assigned values are determined when a new one is determined State transition depending on the degree of belonging to the States or the state transitions to the respective fuzzy Clusters updated.

Die Zustandsübergangs-Wahrscheinlichkeiten können im Rahmen des Reinforcement-Lernverfahrens abhängig von den Zählern ermittelt werden.The state transition probabilities can be within the reinforcement learning process depends on the counters be determined.

Ferner werden gemäß einer weiteren Weiterbildung der Erfindung zu Beginn des Verfahrens Fuzzy-Partitionen gebildet, indem in einem iterativen Verfahren ausgehend von einer vorgegebenen Menge von Ausgangs-Partitions-Untermengen diese aufgeteilt werden in mehrere Fuzzy-Partitions- Untermengen oder zusammengeführt werden aus mehreren Fuzzy- Partitions-Untermengen in eine Fuzzy-Partitions-Untermenge, abhängig von den ermittelten Trainingsdaten.According to a further development, the Invention at the beginning of the process fuzzy partitions formed by starting from. in an iterative process a given set of output partition subsets these are divided into several fuzzy partition Subsets or merged from multiple fuzzy Partition subsets into a fuzzy partition subset, depending on the determined training data.

Zu Beginn des Verfahrens können alternativ die Fuzzy- Partitionen gemäß dem Fuzzy-C-Means-Clustering-Verfahren gebildet werden.Alternatively, at the beginning of the process, the fuzzy Partitions according to the fuzzy C means clustering method be formed.

Anschaulich kann die Erfindung darin gesehen werden, dass zur Steuerung eines technischen Systems die Systembeschreibung mittels Fuzzy-Partitionen und entsprechend mit Fuzzy- Zugehörigkeitsfunktionen diskretisiert werden und in dem diskretisierten Modell unter Verwenden von Reinforcement- Lernens eine Steuerungsstrategie zum Steuern des technischen Systems ermittelt wird.The invention can clearly be seen in that for Control of a technical system the system description by means of fuzzy partitions and accordingly with fuzzy Membership functions are discretized and in the discretized model using reinforcement Learning a control strategy to control the technical Systems is determined.

Eine Fuzzy-Steuervorrichtung weist einen Prozessor auf, der derart eingerichtet ist, dass die oben beschriebenen Verfahrensschritte durchführbar sind.A fuzzy control device has a processor that is set up such that those described above Process steps are feasible.

In einem Computerlesbaren Speichermedium ist ein Programm gespeichert, das bei dessen Ausführung die Verfahrensschritte des oben beschriebenen Verfahrens aufweist. Ferner weist ein Computerprogramm-Element bei dessen Ausführung durch einen Prozessor ebenfalls die Verfahrensschritte des oben beschriebenen Verfahrens auf.There is a program in a computer readable storage medium saved, the process steps when it is executed of the method described above. Furthermore, a Computer program element when executed by a Processor also does the steps above described method on.

Die Erfindung kann sowohl als Computerprogramm, also in Software, als auch mittels einer speziellen elektronischen Schaltung, also in Hardware, realisiert werden.The invention can be used both as a computer program, ie in Software, as well as by means of a special electronic Circuit, i.e. in hardware, can be realized.

Ausführungsbeispiele der Erfindung sind in den Figuren dargestellt und werden im weiteren näher erläutert.Embodiments of the invention are in the figures shown and are explained in more detail below.

Es zeigen:Show it:

Fig. 1 ein Ablaufdiagramm, in dem die einzelnen Verfahrensschritte des Verfahrens gemäß einem Ausführungsbeispiel der Erfindung dargestellt sind; Fig. 1 is a flow diagram in which the individual process steps of the method are illustrated according to an embodiment of the invention;

Fig. 2 eine Skizze eines Verkehrsnetzes, anhand dem ein Ausführungsbeispiel der Erfindung dargestellt wird; Fig. 2 is a sketch of a traffic network, with reference to which an embodiment of the invention is represented;

Fig. 3 eine Skizze eines zentralen Steuerrechners, der mit einzelnen Sensoren in dem Verkehrsnetz gekoppelt ist; Fig. 3 is a sketch of a central control computer, which is coupled to individual sensors in the transport network;

Fig. 4a bis 4d eine Vielzahl von Signalbildern gemäß unterschiedlichen Rahmensignalplänen für verschiedene Kreuzungen des Verkehrsnetzes aus Fig. 2; Figures 4a to 4d, a plurality of signal images in accordance with different frame signal plans for different intersections of the traffic network of Fig. 2.

Fig. 5 eine Skizze eines Rahmensignals; Fig. 5 is a sketch of a frame signal;

Fig. 6 eine Darstellung von Fuzzy-Partitionen und deren Zugehörigkeitsfunktionen; Fig. 6 is an illustration of fuzzy partitions and their membership functions;

Fig. 7a und 7b Darstellungen von unterschiedlichen Clustern. Fig. 7a and 7b representations of different clusters.

Fig. 2 zeigt ein Verkehrsnetz 200, anhand dessen im folgenden das Training und die Auswahl einer verkehrsabhängigen Auswahl eines Rahmensignalplans aus einer Vielzahl gespeicherter Rahmensignalpläne erläutert wird. FIG. 2 shows a traffic network 200 , on the basis of which the training and the selection of a traffic-dependent selection of a frame signal plan from a plurality of stored frame signal plans is explained below.

Das Verkehrsnetz 200 weist eine erste Straße 201 auf, die von einem Wohngebiet 202 zu einem Gewerbegebiet 203 führt. Das Wohngebiet 202 befindet sich im Westen einer Stadt 204, und das Gewerbegebiet 203 liegt im Osten der Stadt 204.The traffic network 200 has a first road 201 which leads from a residential area 202 to a commercial area 203 . The residential area 202 is located in the west of a city 204 , and the commercial area 203 is in the east of the city 204 .

Eine zweite Straße 205 führt von einem sich im Norden der Stadt 204 befindenden ersten Einkaufsgebiet 206 zu einem zweiten Einkaufsgebiet 207 mit Freizeitzentrum, welches im Süden der Stadt 204 liegt.A second street 205 leads from a first shopping area 206 located in the north of the city 204 to a second shopping area 207 with a leisure center, which is located in the south of the city 204 .

Die erste Straße 201 und die zweite Straße 205 kreuzen einander an einer ersten Kreuzung 208.The first street 201 and the second street 205 cross each other at a first intersection 208 .

Weiterhin weist das Verkehrsnetz 200 eine dritte Straße 209 auf, die sich von der ersten Straße 201 aus von einer zweiten Kreuzung 210 bis zu einer dritten Kreuzung 211, die sich an der zweiten Straße 205 befindet, erstreckt. Anschaulich stellt somit die dritte Straße 209 eine Diagonalverbindung von der ersten Straße 201 zu der zweiten Straße 205 dar, wobei die zweite Kreuzung 210 westlich von der ersten Kreuzung 208 liegt, das heißt die zweite Kreuzung 210 liegt näher an dem Wohngebiet 202 als an dem Gewerbegebiet 203.Furthermore, the traffic network 200 has a third road 209 , which extends from the first road 201 from a second intersection 210 to a third intersection 211 , which is located on the second road 205 . Clearly, the third street 209 thus represents a diagonal connection from the first street 201 to the second street 205 , the second intersection 210 being west of the first intersection 208 , that is to say the second intersection 210 is closer to the residential area 202 than to the commercial area 203 .

Weiterhin führt eine vierte Straße 212 von der dritten Kreuzung 211 zu einer vierten Kreuzung 213, wobei die vierte Kreuzung 213 auf der ersten Straße 201 östlich von der ersten Kreuzung 208 liegt, das heißt näher an dem Gewerbegebiet 203 als an dem Wohngebiet 202.Furthermore, a fourth street 212 leads from the third intersection 211 to a fourth intersection 213 , the fourth intersection 213 on the first street 201 lying east of the first intersection 208 , that is to say closer to the commercial area 203 than to the residential area 202 .

An jeder Kreuzung sind für jede Richtung, die ein Fahrzeug auf der Straße fahren kann, Ampeln vorgesehen, die den Verkehrsfluss an der jeweiligen Kreuzung 208, 210, 211, 213, steuern. At each intersection, traffic lights are provided for each direction that a vehicle can travel on the road, which control the flow of traffic at the respective intersection 208 , 210 , 211 , 213 .

Die Ampeln werden von einer im Weiteren beschriebenen zentralen Steuereinheit gesteuert.The traffic lights are described by one below central control unit controlled.

Ferner sind auf den Straßen Sensoren 215 vorgesehen, mit dem die Anzahl der an dem Sensor vorbeifahrenden oder über den Sensor fahrenden Fahrzeuge erfasst werden können.Furthermore, sensors 215 are provided on the roads, with which the number of vehicles passing by the sensor or traveling over the sensor can be detected.

Ein solcher Sensor 215 kann beispielsweise eine Leiterschleife sein, die in die jeweilige Straße eingebracht ist oder auch eine Lichtschranke oder ein Ultraschallsensor, mit denen jeweils das Vorbeifahren eines Fahrzeugs an dem jeweiligen Sensor in einer vorgegebenen Richtung, für die der Sensor 215 vorgesehen ist, sein.Such a sensor 215 can be, for example, a conductor loop that is introduced into the respective road or a light barrier or an ultrasound sensor, with which a vehicle can pass the respective sensor in a predetermined direction for which the sensor 215 is intended .

Jedes Mal, wenn ein Fahrzeug den Sensor 215 passiert, wird von dem Sensor 215 ein Erfassungssignal an einen im weiteren beschriebenen zentralen Rechner 301 übertragen.Every time a vehicle passes sensor 215 , sensor 215 transmits a detection signal to a central computer 301 described below.

Alternativ kann in dem Sensor 215 auch ein Zähler vorgesehen sein, der für eine vorgegebene Zeitdauer für jedes den Sensor 215 passierende Fahrzeug den Zähler inkrementiert wird und nach Ablauf der vorgegebenen Zeitdauer wird der Zählerstand an den zentralen Steuerrechner 301 übermittelt und anschließend wird der Zähler auf einen vorgegebenen Zählerstand zurückgesetzt.Alternatively, a sensor can also be provided in the sensor 215 , the counter being incremented for a predetermined period of time for each vehicle passing through the sensor 215 , and after the predetermined period of time has elapsed, the counter reading is transmitted to the central control computer 301 and the counter is then changed to a preset counter reading reset.

In der Stadt 204 ergeben sich zu unterschiedlichen Tageszeiten unterschiedliche Anforderungen an die Schaltung, d. h. die Steuerung der Ampeln 214, da unterschiedliche Arten von Verkehrsströmen und unterschiedliche Hauptbelastungen zu unterschiedlichen Tageszeiten innerhalb des Verkehrsnetzes 200 auftreten.In the city 204 there are different requirements for the switching, ie the control of the traffic lights 214 , at different times of the day, since different types of traffic flows and different main loads occur at different times of the day within the traffic network 200 .

So kommt es an einem Morgen eines Tages, das heißt im wesentlichen in einer Zeit von 6.00 Uhr bis 9.30 Uhr, vornehmlich zu Berufsverkehr, der vom Wohngebiet 202 in das Gewerbegebiet 203, das erste Einkaufsgebiet 206 und das zweite Einkaufsgebiet 207 führt.In the morning of one day, i.e. essentially between 6 a.m. and 9.30 a.m., there is primarily rush-hour traffic, which leads from residential area 202 to commercial area 203 , first shopping area 206 and second shopping area 207 .

Vormittags, das heißt im wesentlichen in einer Zeit von 9.30 Uhr bis 12.00 Uhr eines Tages kommt es zu einer Hauptverkehrsrichtung gerichtet von dem Wohngebiet 202 zu dem ersten Einkaufsgebiet 206 und dem zweiten Einkaufsgebiet 207, wobei der Verkehrsfluss einem Einkaufsverkehr der Bewohner der Stadt 204 entspricht.In the mornings, i.e. essentially between 9:30 a.m. and 12:00 p.m. one day, a main traffic direction occurs from the residential area 202 to the first shopping area 206 and the second shopping area 207 , the flow of traffic corresponding to a shopping traffic of the residents of the city 204 .

Nachmittags, das heißt im wesentlichen in einer Zeit von 12.00 Uhr bis 16.00 Uhr, kommt es neben dem Einkaufsverkehr wiederum zu Berufsverkehr, hauptsächlich von dem Gewerbegebiet 203 gerichtet zu dem Wohngebiet 202.In the afternoons, i.e. essentially between 12 p.m. and 4 p.m., in addition to shopping traffic, there is also rush hour traffic, mainly directed from commercial area 203 to residential area 202 .

Abends, das heißt im wesentlichen in einer Zeit von 16.00 Uhr bis 21.00 Uhr, ist der hauptsächliche Verkehr zwischen dem Wohngebiet 202 und dem Freizeitzentrum in dem zweiten Einkaufsgebiet 207 zu verzeichnen.In the evening, that is essentially between 4:00 p.m. and 9:00 p.m., the main traffic between the residential area 202 and the leisure center is recorded in the second shopping area 207 .

Gemäß diesem Ausführungsbeispiel wird von den Sensoren 215 die Sensorbelegung B, die definiert ist als Zeit, in der der Sensor 215 belegt ist im Verhältnis zu der Zeitdauer, während der die Belegung erfasst wird, erfasst. Die Sensorbelegung B kann beispielsweise mittels einer Induktionsschleife als Sensor 215 ermittelt werden. Alternativ, beispielsweise bei einem Erfassen einer Verkehrskenngröße mittels eines visuellen Sensors, kann die Verkehrsdichte ρ gemessen werden. Die Belegung B, die zumeist ähnlich ist der Verkehrsdichte ρ ergibt sich somit jeweils an einem Sensor 215 gemäß folgender Vorschrift:
According to this exemplary embodiment, sensors 215 detect sensor occupancy B, which is defined as the time in which sensor 215 is occupied in relation to the time period during which occupancy is detected. The sensor assignment B can be determined, for example, by means of an induction loop as the sensor 215 . As an alternative, for example when recording a traffic parameter using a visual sensor, the traffic density ρ can be measured. The occupancy B, which is usually similar to the traffic density ρ, is thus obtained on a sensor 215 in accordance with the following regulation:

wobei mit
being with

- tb is the time during which the sensor is occupied, d. H. during which a vehicle is over the Sensor is located, and
- t is the length of time during which the number m of the vehicles is determined.

Gemäß diesem Ausführungsbeispiel wird jeweils an jedem Sensor 215 für eine Zeitdauer t von 15 Minuten die mittlere Belegung B des Sensors 215 ermittelt und anschließend wird die gemäß Vorschrift (1) ermittelte mittlere Belegung B an den im weiteren beschriebenen zentralen Steuerrechner 301 übermittelt.According to this exemplary embodiment, the average occupancy B of the sensor 215 is determined on each sensor 215 for a time period t of 15 minutes, and then the average occupancy B determined according to regulation (1) is transmitted to the central control computer 301 described below.

Fig. 3 zeigt den zentralen Steuerrechner 301, der mit den Sensoren 215 beispielsweise über eine Funkverbindung oder eine leitungsgebundene Verbindung 302 gekoppelt ist. FIG. 3 shows the central control computer 301 , which is coupled to the sensors 215, for example via a radio connection or a wired connection 302 .

Der Steuerrechner 301 weist eine Eingangs-/Ausgangs- Schnittstelle 303 sowie eine zentrale Prozessoreinheit 304 und einen Speicher 305 auf, die jeweils über einen Computerbus 306 miteinander gekoppelt sind.The control computer 301 has an input / output interface 303 and a central processor unit 304 and a memory 305 , which are each coupled to one another via a computer bus 306 .

Ferner ist über die Eingangs-/Ausgangs-Schnittstelle 303 über eine erste Verbindung 307, z. B. über ein Kabel oder eine Infrarot-Funkverbindung eine Computermaus 308 mit dem Steuerrechner 301 gekoppelt.Furthermore, via the input / output interface 303 via a first connection 307 , e.g. B. coupled via a cable or an infrared radio link, a computer mouse 308 with the control computer 301 .

Über eine zweite Verbindung 309 ist ein Bildschirm 310 mit der Eingangs-/Ausgangs-Schnittstelle 303 gekoppelt.A screen 310 is coupled to the input / output interface 303 via a second connection 309 .

Ferner ist mit der Eingangs-/Ausgangs-Schnittstelle 303 eine Tastatur 312 über eine dritte Verbindung 311 gekoppelt.Furthermore, a keyboard 312 is coupled to the input / output interface 303 via a third connection 311 .

Gemäß diesem Ausführungsbeispiel ist in dem Speicher 305 des Steuerrechners 301 eine Vielzahl von Rahmensignalplänen 313 gespeichert. According to this exemplary embodiment, a plurality of frame signal plans 313 are stored in the memory 305 of the control computer 301 .

Die Vielzahl der Rahmensignalpläne 313 ist in der folgenden Tabelle dargestellt, wobei mit A1, A2, B1, B2, B3, C1, C2, D1, D2, D3 jeweils Signalbilder für die erste Kreuzung 208 (B1, B2, B3), die zweite Kreuzung 210 (A1, A2), die dritte Kreuzung 211 (D1, D2, D3) sowie die vierte Kreuzung 213 (C1, C2), wie sie in Fig. 4 dargestellt sind, bezeichnet werden.The plurality of frame signal plans 313 is shown in the following table, with A1, A2, B1, B2, B3, C1, C2, D1, D2, D3, signal images for the first intersection 208 (B1, B2, B3), the second Intersection 210 (A1, A2), the third intersection 211 (D1, D2, D3) and the fourth intersection 213 (C1, C2), as shown in FIG. 4, are designated.

Gemäß dem Ausführungsbeispiel sind drei Rahmensignalpläne RSP1, RSP2, RSP3 in dem Speicher 305 gespeichert, wie in der folgenden Tabelle dargestellt:
According to the exemplary embodiment, three frame signal plans RSP1, RSP2, RSP3 are stored in the memory 305 , as shown in the following table:

Ein Rahmensignalplan weist eine Menge sogenannter Rahmensignale auf, die jeweils einen Verkehrsstrom bestimmen, in welchen zeitlichen Beschränkungen welche Zustände der auf diesen Verkehrsstrom wirkenden Lichtsignale an den Ampeln 214 erlaubt sind.A frame signal plan has a set of so-called frame signals, each of which determines a traffic flow, in which time restrictions which states of the light signals acting on this traffic flow are permitted at the traffic lights 214 .

Ein Beispiel-Rahmensignal ist in Fig. 5 dargestellt. Eine Periode eines Lichtsignals 501 des Rahmensignals 500 weist einen Anforderungsbereich 502 und einen Verlängerungsbereich 503 auf.An example frame signal is shown in FIG. 5. A period of a light signal 501 of the frame signal 500 has a request area 502 and an extension area 503 .

Innerhalb dieses zeitlichen Rahmens kann eine lokale Optimierung hinsichtlich der im weiteren genannten Ziele, insbesondere einer Optimierung des Verkehrsstroms, durchgeführt werden, beispielsweise durch Ausdehnung von Grünphasen oder eine Bevorrechtigung des öffentlichen Nahverkehrs. Within this time frame, a local Optimization with regard to the goals mentioned below, in particular an optimization of the traffic flow, be carried out, for example by expanding Green phases or preferential treatment for the public Local transport.

Innerhalb des Anforderungsbereichs 502 können insbesondere bei anstehendem Verkehr, das heißt bei an der Ampel 214 stehenden oder sich einer jeweiligen Ampel 214 nähernden Fahrzeugen, Grünphasen der Ampel 214 eingeleitet werden, die innerhalb des Verlängerungsbereichs 504 beendet werden müssen.In the request area 502 , in particular in the case of pending traffic, that is to say in the case of vehicles standing at the traffic light 214 or approaching a respective traffic light 214 , green phases of the traffic light 214 can be initiated, which must be ended within the extension area 504 .

In den Fig. 4a bis Fig. 4d sind durch die Pfeile jeweils die während der Dauer, das heißt der Gültigkeit des jeweiligen Signalbildes zulässigen Fahrrichtungen der Fahrzeuge an der jeweiligen Kreuzung dargestellt.In FIGS. 4a through FIG. 4d are those of the validity of the respective signal image permissible traveling directions of the vehicles represented in the respective crossing during the period, that is by the arrows, respectively.

Die Zahlen in der oben dargestellten Tabelle zu einem jeweiligen Signalbild, wie es in den Fig. 4a bis Fig. 4d dargestellt ist, entsprechen der Dauer der Gültigkeit des jeweiligen Signalbildes pro Periode des jeweiligen Rahmensignalplans.The figures in the above table to a respective signal image as shown in Fig. 4a through FIG. 4d, the duration of validity corresponding to the respective image signal per period of the respective frame signal plan.

So gibt beispielsweise der erste Rahmensignalplan RSP1 an, dass ein in Fig. 4a dargestelltes erstes Signalbild 401 aufgrund der Zahl 60 verglichen mit dem zweiten Signalbild 402 (zugeordnete Wertezahl 30) eine doppelt so lange Gültigkeitsdauer aufweist.For example, the first frame signal plan RSP1 indicates that a first signal image 401 shown in FIG. 4a has twice as long a validity period due to the number 60 compared to the second signal image 402 (assigned value number 30).

Gemäß dem zweiten Rahmensignalplan RSP2 und dem dritten Rahmensignalplan RSP3, haben das erste Signalbild 401 und das zweite Signalbild 402 jeweils die gleiche Gültigkeitsdauer (jeweils beiden Signalbildern 401, 402 ist die gleiche Wertezahl 45 zugeordnet).According to the second frame signal plan RSP2 and the third frame signal plan RSP3, the first signal image 401 and the second signal image 402 each have the same validity period (the same value number 45 is assigned to both signal images 401 , 402 ).

Anschaulich bedeutet dies, dass an der zweiten Kreuzung 205 aufgrund der Ampelschaltung die Ampeln 214 derart geschaltet sind, dass der in dem ersten Signalbild 401 bzw. dem zweiten Signalbild 402 dargestellte Verkehrsstrom jeweils in gleicher Gewichtung möglich ist. This clearly means that the traffic lights 214 are switched at the second intersection 205 due to the traffic light switching in such a way that the traffic flow shown in the first signal image 401 and the second signal image 402 is possible with the same weighting.

Der erste Rahmensignalplan RSP1 gibt für die erste Kreuzung 208 in einem in Fig. 4b dargestellten dritten Signalbild 403, vierten Signalbild 404 und fünften Signalbild 405 vor, dass das dritte Signalbild 403 doppelt so lange Gültigkeit pro Periode hat wie das vierte Signalbild 404 und dass das fünfte Signalbild 405 aufgrund der Ampelschaltung der Ampel 214 an der ersten Kreuzung 208 gar nicht gebildet wird (Wertezahl drittes Signalbild 403: 60, Wertezahl viertes Signalbild 404: 30, Wertezahl fünftes Signalbild 405: 0).The first frame signal plan RSP1 specifies for the first intersection 208 in a third signal image 403 , fourth signal image 404 and fifth signal image 405 shown in FIG. 4b that the third signal image 403 is twice as long per period as the fourth signal image 404 and that fifth signal image 405 is not formed at all because of the traffic light switching of traffic light 214 at first intersection 208 (value number third signal image 403 : 60, value number fourth signal image 404 : 30, value number fifth signal image 405 : 0).

Gemäß dem zweiten Rahmensignalplan RSP2 sind das dritte Signalbild 403 und das vierte Signalbild 404 gleich gewichtet, und das fünfte Signalbild 405 wird aufgrund der Ampelsteuerung nicht gebildet (Wertezahl drittes Signalbild 403: 45, Wertezahl viertes Signalbild 404: 45, Wertezahl fünftes Signalbild 405: 0).According to the second frame signal plan RSP2, the third signal image 403 and the fourth signal image 404 are weighted equally, and the fifth signal image 405 is not formed due to the traffic light control (value number third signal image 403 : 45, value number fourth signal image 404 : 45, value number fifth signal image 405 : 0 ).

Gemäß dem dritten Rahmensignalplan RSP3 ist das fünfte Signalbild 405 durch die Ampelschaltung der Ampeln 214 an der ersten Kreuzung 208 erheblich stärker gewichtet als das dritte Signalbild 403 und das vierte Signalbild 404 (Wertezahl drittes Signalbild 403: 20, Wertezahl viertes Signalbild 404: 20, Wertezahl fünftes Signalbild 405: 50).According to the third frame signal plan RSP3, the fifth signal image 405 is weighted considerably more than the third signal image 403 and the fourth signal image 404 (value number third signal image 403 : 20, value number fourth signal image 404 : 20, value number) due to the traffic lights 214 at the first intersection 208 fifth signal pattern 405 : 50).

An der dritten Kreuzung 211 erfolgt gemäß dem ersten Rahmensignalplan RSP1 die Ampelschaltung der Ampeln 214 derart, dass das in Fig. 4c dargestellte sechste Signalbild 406 halb so stark gewichtet wird, das heißt, verglichen mit dem achten Signalbild 408 nur eine halbe Gültigkeitsdauer aufweist. Das siebte Signalbild 407 wird gemäß dem ersten Rahmensignalplan RSP1 überhaupt nicht erzeugt (Wertezahl sechstes Signalbild 406: 30, Wertezahl siebtes Signalbild 407: 0, Wertezahl achtes Signalbild 408: 60).At the third intersection 211 , according to the first frame signal plan RSP1, the traffic lights 214 are switched on in such a way that the sixth signal image 406 shown in FIG. 4c is weighted half as much, that is to say only half the validity period compared to the eighth signal image 408 . The seventh signal image 407 is not generated at all in accordance with the first frame signal plan RSP1 (value number sixth signal image 406 : 30, value number seventh signal image 407 : 0, value number eighth signal image 408 : 60).

Gemäß dem zweiten Rahmensignalplan RSP2 sind das sechste Signalbild 406 und das achte Signalbild 408 gleich gewichtet (Wertezahl sechstes Signalbild 406: 45, Wertezahl siebtes Signalbild 407: 0, Wertezahl achtes Signalbild 408: 45) und gemäß dem dritten Rahmensignalplan RSP3 ist das siebte Signalbild 407 erheblich stärker gewichtet als das sechste Signalbild 406 und das achte Signalbild 408 (Wertezahl sechstes Signalbild 406: 15, Wertezahl siebtes Signalbild 407: 65, Wertezahl achtes Signalbild 408: 10).According to the second frame signal plan RSP2, the sixth signal picture 406 and the eighth signal picture 408 are weighted equally (value number sixth signal picture 406 : 45, value number seventh signal picture 407 : 0, value number eighth signal picture 408 : 45) and according to the third frame signal plan RSP3 is the seventh signal picture 407 weighted considerably more than the sixth signal image 406 and the eighth signal image 408 (number of values sixth signal image 406 : 15, number of values seventh signal image 407 : 65, number of values eighth signal image 408 : 10).

An der vierten Kreuzung 213 wird gemäß dem ersten Rahmensignalplan RSP1 das in Fig. 4d dargestellte neunte Signalbild 409 doppelt so stark gewichtet, das heißt, es weist eine doppelt so lange Gültigkeitsdauer auf, als das zehnte Signalbild 410 (Wertezahl neuntes Signalbild 409: 60, Wertezahl zehntes Signalbild 410: 30).At the fourth intersection 213 , according to the first frame signal plan RSP1, the ninth signal image 409 shown in FIG. 4d is weighted twice as much, that is to say it has a validity period twice as long as the tenth signal image 410 (number of ninth signal images 409 : 60, Number of values tenth signal pattern 410 : 30).

Gemäß dem zweiten Rahmensignalplan RSP2 und dem dritten Rahmensignalplan RSP3 weisen die beiden Signalbilder 409, 410 jeweils eine gleiche Gültigkeitsdauer pro Periode auf (Wertezahl neuntes Signalbild 409: 45, Wertezahl zehntes Signalbild 410: 45).According to the second frame signal plan RSP2 and the third frame signal plan RSP3, the two signal images 409 , 410 each have the same period of validity per period (number of values ninth signal image 409 : 45, number of values tenth signal image 410 : 45).

Wie aus der oben dargestellten Tabelle ersichtlich ist, stellt der erste Rahmensignalplan RSP1 eine hinsichtlich des Berufsverkehrs optimierte Ampelschaltung der Ampeln 214 in dem Verkehrsnetz 200 dar.As can be seen from the table shown above, the first frame signal plan RSP1 represents a traffic light switching of the traffic lights 214 in the traffic network 200 which is optimized with respect to rush hour traffic.

Der zweite Rahmensignalplan RSP2 gewichtet alle Verbindungen in dem Verkehrsnetz weitgehend gleichmäßig, so dass auch zwischen dem ersten Einkaufsgebiet und dem zweiten Einkaufsgebiet 207 eine gute Verbindung, das heißt ein guter Verkehrsfluss hinsichtlich der jeweiligen Anforderungen möglich ist.The second frame signal plan RSP2 weights all connections in the traffic network largely uniformly, so that a good connection, that is to say a good traffic flow, is also possible between the first shopping area and the second shopping area 207 with regard to the respective requirements.

Der dritte Rahmensignalplan RSP3 ist hinsichtlich des Verkehrs zwischen dem Wohngebiet 202 und dem südlich gelegenen zweiten Einkaufsgebiet 207 optimiert, das heißt es bevorzugt den Verkehrsfluss zwischen dem Wohngebiet 202 und dem zweiten Einkaufsgebiet 207. The third frame signal plan RSP3 is optimized with regard to the traffic between the residential area 202 and the second shopping area 207 located to the south, that is to say it preferably favors the traffic flow between the residential area 202 and the second shopping area 207 .

Von dem zentralen Steuerrechner 301 wird gemäß dem im weiteren beschriebenen Reinforcement-Lernverfahren unter Verwendung von Fuzzy-Zugehörigkeitsfunktionen und Fuzzy- Partitionen eine optimierte Auswahl der Rahmensignalpläne zum Gewährleisten eines maximalen Gewinns, der gemäß diesem Ausführungsbeispiel als Summe der quadrierten mittleren relativen Verkehrsdichten pro Strecke l, beispielsweise vor einer Kreuzung, verwendet wird, das heißt, der Gewinn g des im weiteren beschriebenen Reinforcement-Lernverfahrens zur Ermittlung der optimierten Kontrollstrategie, das heißt Steuerungsstrategie, die gebildet wird durch die entsprechende Auswahl des für die ermittelten Verkehrsdichten ρ, die mit den mittleren Belegungen B angenähert werden, im Zusammenhang mit dem Reinforcement-Lernverfahren optimierte Auswahl des Rahmensignalplans RSP1, RSP2, RSP3 gemäß folgender Vorschrift:
From the central control computer 301 , according to the reinforcement learning method described below, using fuzzy membership functions and fuzzy partitions, an optimized selection of the frame signal plans is provided to ensure a maximum gain, which according to this exemplary embodiment is the sum of the squared average relative traffic densities per route 1. For example, before an intersection, that is, the gain g of the reinforcement learning method described below for determining the optimized control strategy, that is to say control strategy, which is formed by the corresponding selection of the traffic densities ρ determined for the average occupancies B approximated, in connection with the reinforcement learning process, optimized selection of the frame signal plan RSP1, RSP2, RSP3 according to the following regulation:

wobei mit
being with

- ρ l, max the maximum possible traffic density and with
- ρ l is the average traffic density on route l at the end of a period of 15 minutes

bezeichnet wird.referred to as.

Anschaulich hat der Steuerrechner 301 somit eine Strategie zu lernen, die die Summe der Gewinne g minimiert.The control computer 301 thus clearly has to learn a strategy that minimizes the sum of the profits g.

Die Grundidee der Vorschrift (2) kann darin gesehen werden, dass durch die Auswahl der Rahmensignalpläne die mittlere Verkehrsdichte in dem Verkehrsnetz 200 minimiert werden soll, wobei durch die Quadratur der Terme bezüglich der einzelnen Strecken l ein homogener Netzzustand mit mittleren Verkehrsdichten an allen Strecken l besser bewertet wird, als ein Zustand mit sehr geringen Verkehrsdichten an einigen Strecken l bei gleichzeitigen Staus an anderen Strecken l.The basic idea of regulation (2) can be seen in the fact that the average traffic density in the traffic network 200 is to be minimized by the selection of the frame signal plans, the quadrature of the terms with respect to the individual routes l resulting in a homogeneous network state with average traffic densities on all routes l is rated better than a condition with very low traffic density on some routes l with congestion on other routes l.

Bei den im weiteren beschriebenen Ausführungsbeispielen sind für alle Lernverfahren, die über einen Zeitraum von jeweils 90 Sekunden gemittelten relativen Fahrzeugdichten, die gemäß folgender Vorschrift gebildet werden an den Stellen des Verkehrsnetzes, an denen Sensoren 215 vorhanden sind, ermittelt:
In the exemplary embodiments described below, for all learning methods, the relative vehicle densities averaged over a period of 90 seconds in each case, which are formed according to the following regulation, are determined at the points in the traffic network where sensors 215 are present:

In Fig. 2 ist dies jeweils durch Darstellungen von einzelnen Verkehrsdichtenverläufen 216, 217, 218 symbolisch dargestellt.In FIG. 2 this is symbolically represented by representations of individual traffic density profiles 216 , 217 , 218 .

Die relativen Verkehrsdichten werden nichtlinear gemäß folgender Vorschrift:
The relative traffic densities become non-linear in accordance with the following regulation:

verzerrt, so dass sich im Bereich kleiner Verkehrsdichten grundsätzlich eine höhere Auflösung ergibt als im Bereich hoher Verkehrsdichten.distorted, so that in the area of low traffic density basically gives a higher resolution than in the area high traffic density.

Im weiteren wird eine Modell-Beschreibung des Verkehrsnetzes 200 und dessen Steuerung als technisches System in allgemeiner Form als endlicher Zustandsautomat mit einer Menge kontinuierlicher Zustände und kontinuierlicher Aktionen, aufgrund derer ein Zustandsübergang von einem Vorgängerzustand in einen Nachfolgezustand ausgelöst wird, beschrieben. A model description of the traffic network 200 and its control as a technical system in a general form as a finite state machine with a number of continuous states and continuous actions, on the basis of which a state transition from a previous state to a subsequent state is triggered, is described below.

Der Aktionsraum kann sowohl kontinuierlich als auch diskret sein.The action space can be both continuous and discreet be.

Allgemein wird das zu steuernde technische System erfindungsgemäß unter Verwendung folgender Komponenten beschrieben:
Das technische System weist einen kontinuierlichen Zustandsraum ℵ der Dimension d^ℵ auf.In general, the technical system to be controlled is described according to the invention using the following components:
The technical system has a continuous state space ℵ of dimension d ^ℵ .

Ferner weist das technische System einen kontinuierlichen Aktionsraum A der Dimension d^A auf oder einen diskreten Raum U.Furthermore, the technical system has a continuous action space A of dimension d ^A or a discrete space U.

Mit bedingten Wahrscheinlichkeitsdichtefunktionen p(y, x, a) wird die Wahrscheinlichkeit für einen Übergang von einem Zustand x in einen Zustand y bei Ausführung der Aktion a beschrieben.Conditional probability density functions p ( y , x , a ) describe the probability of a transition from a state x to a state y when action a is carried out .

Mit einem Gewinn g(x, a, y) im Sinne eines Reinforcement- Lernens wird ein Gewinn g(x, a, y) beschrieben bei Ausführung einer Aktion a in dem Vorgängerzustand x, wenn das technische System aufgrund der Steuerung in einen Nachfolgezustand y aufgrund der Aktion a übergeht.With a gain g ( x , a , y ) in the sense of reinforcement learning, a gain g ( x , a , y ) is described when an action a is carried out in the previous state x if the technical system changes into a successor state y due to the control due to action a .

Der Zustandsraum ℵ ist in Fuzzy-Partitionen mit Fuzzy- Zugehörigkeitsfunktionen
The state space ℵ is in fuzzy partitions with fuzzy membership functions

gruppiert, für die gilt:
grouped, for which the following applies:

Die Fuzzy-Partitionen werden mit
The fuzzy partitions are with

bezeichnet und weisen jeweils ein Fuzzy-Zentrum auf, das mit
denotes and each have a fuzzy center, which with

bezeichnet wird.referred to as.

Ferner ist auch der Aktionsraum A in Fuzzy-Partitionen mit Zugehörigkeitsfunktionen
Furthermore, action space A is also in fuzzy partitions with membership functions

gruppiert, für die gilt:
grouped, for which the following applies:

Die Fuzzy-Partitionen des Aktionsraums A werden mit
The fuzzy partitions of action space A are included

bezeichnet und weisen jeweils Fuzzy-Zentren
denotes and each have fuzzy centers

auf.on.

Erfindungsgemäß sind unterschiedliche Möglichkeiten zum Bilden der Fuzzy-Partitionen des Zustandsraums ℵ vorgesehen.According to the invention there are different options for Forming the fuzzy partitions of the state space ℵ provided.

Es werden somit Fuzzy-Partitionen
It becomes fuzzy partitions

gebildet.educated.

Gemäß einer Alternative kann zur Bildung der Fuzzy- Partitionen des Zustandsraums ℵ ein Fuzzy-C-Means- Clustering, wie es in [2] beschrieben ist, durchgeführt werden.According to an alternative, to form the fuzzy Partitions of the state space ℵ a fuzzy C means Clustering as described in [2] will.

Gemäß einer weiteren Alternative ist es vorgesehen, die Fuzzy-Partitionen auf eine Weise zu bilden, wie sie in Fig. 6 dargestellt ist.According to a further alternative, it is provided to form the fuzzy partitions in a manner as shown in FIG. 6.

Die relative Verkehrsdichte ist in einem Intervall von "0" bis "1" in vier Partitionen 601, 602, 603, 604 gruppiert, denen jeweils über einen vorgegebenen Intervall Zugehörigkeitsfunktionen 605, 606, 607, 608 zugeordnet sind.The relative traffic density is grouped in an interval from "0" to "1" into four partitions 601 , 602 , 603 , 604 , to which membership functions 605 , 606 , 607 , 608 are assigned over a predetermined interval.

Eine erste Fuzzy-Zugehörigkeitsfunktion 605 beschreibt eine sehr geringe Verkehrsdichte "very small", eine zweite Fuzzy- Zugehörigkeitsfunktion 606 eine geringe Verkehrsdichte "small", eine dritte Fuzzy-Zugehörigkeitsfunktion 607 eine hohe Verkehrsdichte "high" und eine vierte Fuzzy- Zugehörigkeitsfunktion 608 eine sehr hohe Verkehrsdichte "very high".A first fuzzy membership function 605 describes a very low traffic density "very small", a second fuzzy membership function 606 describes a low traffic density "small", a third fuzzy membership function 607 describes a high traffic density "high" and a fourth fuzzy membership function 608 a very high traffic density "very high".

Die in Fig. 6 dargestellten Fuzzy-Zentren und Grenzen der einzelnen Fuzzy-Zugehörigkeitsfunktionen und Fuzzy- Partitionen können alternativ gemäß folgender Vorgehensweise bestimmt werden.The fuzzy centers and limits of the individual fuzzy membership functions and fuzzy partitions shown in FIG. 6 can alternatively be determined according to the following procedure.

Zustandsübergänge des oben dargestellten technischen Systems (x _k, u_k, x _k+1, _k) können durch Vektoren (x _k, x _k+1, _k) in einem Zustandsübergangs-Raum T := ℵ' × ℵ" × beschrieben werden, wobei ℵ' und ℵ" den gleichen Zustandsraum ℵ bezeichnen.State transitions of the technical system shown above ( x _k , u _k , x _{k + 1} , _k ) can be described by vectors ( x _k , x _{k + 1} , _k ) in a state transition space T: = ℵ '× ℵ "× , where ℵ 'and ℵ "denote the same state space ℵ.

Im weiteren wird ein Clustering der Fuzzy-Cluster durchgeführt in dem Zustandsübergangs-Raum T aufgrund der beobachteten Zustandsübergänge während einer Lernphase unter Verwendung von Trainingsdaten, die aus einem technischen System ermittelt werden, beispielsweise durch Messung oder auch durch Simulation des technischen Systems, gemäß diesem Ausführungsbeispiel mit den ermittelten Verkehrsdichten als Trainingsdaten.The fuzzy cluster is also clustered performed in the state transition space T due to the observed state transitions during a learning phase Use of training data from a technical System can be determined, for example by measurement or also by simulating the technical system according to this Embodiment with the determined traffic densities as Training data.

Für jede Aktion u ∈ U werden separate Cluster, das heißt Fuzzy-Partitionen, verwendet.For every action u ∈ U separate clusters, that is fuzzy partitions used.

Ferner wird ein Clustering in dem Zustandsraum ℵ durchgeführt unter Verwendung der beobachteten Zustände während der oben beschriebenen Lernphase.Clustering in state space ℵ performed using the observed conditions during the learning phase described above.

Es ist anzumerken, dass gemäß dem im weiteren beschriebenen Verfahren das Clustern der Zustände und der Zustandsübergänge inkrementell durchgeführt wird, so dass keine Zustandsübergänge explizit gespeichert werden müssen, wie dies gemäß dem Fuzzy-C-Means-Clustering, das jedoch ohne weiteres gemäß einer weiteren Alternative durchgeführt werden kann, erforderlich wäre.It should be noted that according to that described below Process the clustering of states and state transitions is performed incrementally so that none State transitions must be saved explicitly, such as this according to fuzzy-C-means clustering, but without be carried out according to another alternative can, would be required.

Ergebnis des Fuzzy-Clusterings, das heißt des Bildens der Fuzzy-Partitionen mit den zugehörigen Fuzzy- Zugehörigkeitsfunktionen sind unmittelbar die Fuzzy- Partitionen des Zustandsraums ℵ, die in dem im weiteren beschriebenen Reinforcement-Lernverfahrens und der sich daraus ergebenden Steuerungsstrategie verwendet werden.Result of fuzzy clustering, that is, the formation of the Fuzzy partitions with the associated fuzzy Membership functions are directly the fuzzy Partitions of the state space ℵ, which in the further described reinforcement learning process and the resulting control strategy can be used.

Die Cluster in dem Zustandsübergangs-Raum dienen als kompakte Beschreibung der beobachteten Zustandsübergänge, aus dem das Modell, das heißt die bedingten Zustandsübergangswahrscheinlichkeiten, wie sie oben beschrieben worden sind, und die Gewinne g, wie im weiteren beschrieben, ermittelt werden können.The clusters in the state transition room serve as compact Description of the observed state transitions from which the Model, that is, the conditional State transition probabilities as above have been described, and the profits g as below described, can be determined.

Außerdem werden die Cluster in dem Zustandsübergangs-Raum verwendet zum Bestimmen von im weiteren beschriebenen optional vorgesehenem Aufspalten und Vereinigen von Clustern während des Bildens der Fuzzy-Partitionen im Rahmen des inkrementellen Verfahrens. The clusters are also in the state transition space used to determine those described below optional intended splitting and merging of clusters during the formation of the fuzzy partitions under the incremental procedure.

Das Aufspalten bzw. Vereinigen von einem Fuzzy-Cluster wird anhand der Fig. 7a und Fig. 7b beschrieben.The splitting or combining of a fuzzy cluster is described with reference to Fig. 7a and Fig. 7b.

Gemäß der in Fig. 7a beschriebenen Situation wird angenommen, dass ein Zustandsübergang von einem Zustand
According to the situation described in FIG. 7a, it is assumed that a state transition from one state

in einen Zustand
in a state

und von ferner von einem Zustand
and further from a state

in einen Zustand
in a state

mit einem identischen Gewinn von
with an identical profit of

beobachtet wird.is observed.

Das mittlere Cluster 701 der drei in Fig. 7a dargestellten Cluster 701, 702, 703 würde es bei dessen Aufspalten ermöglichen, im Rahmen des Lernens zwischen diesen zwei Klassen von Zustandsübergängen in dem diskretisierten Modell zu unterscheiden. The middle cluster 701 of the three clusters 701 , 702 , 703 shown in FIG. 7a would, when split up, make it possible to distinguish between these two classes of state transitions in the discretized model in the course of learning.

In dem in Fig. 7b dargestellten Beispiel, bei dem alle Zustandsübergänge in einem Bereich des mittleren Clusters 701 beginnen und in einem ähnlichen Endzustand
In the example shown in FIG. 7b, in which all state transitions begin in a region of the middle cluster 701 and in a similar final state

enden, wobei jedoch zwei unterschiedliche Klassen von Gewinnen
end up, however, with two different classes of winnings

und
and

in der Trainingsphase beobachtet werden, würde eine Aufspaltung des mittleren Clusters 701 eine verbesserte Unterscheidung dieser Klassen in den Zustandsübergängen ermöglichen.observed in the training phase, splitting the middle cluster 701 would enable an improved differentiation of these classes in the state transitions.

Somit ist ersichtlich, dass in den in Fig. 7a und in Fig. 7b dargestellten Fällen jedes Mal ein Aufspalten des mittleren Clusters 701 eine Verbesserung des Lernverfahrens und des durch das Lernverfahren gebildeten Fuzzy-Sets von Fuzzy- Partitionen erzielen würde.It can thus be seen that in the cases illustrated in FIG. 7a and in FIG. 7b, splitting the middle cluster 701 each time would improve the learning process and the fuzzy set of fuzzy partitions formed by the learning process.

Eine entsprechende Vorgehensweise kann gemäß einer optionalen Erweiterung der Vorgehensweise durch Vereinigen von einzelnen Fuzzy-Partitionen, das heißt von Clustern, erreicht werden, wobei beim Vereinigen grundsätzlich eine analoge Vorgehensweise gewählt wird verglichen mit dem Aufteilen der Partitionen.A corresponding procedure can be carried out according to an optional Extending the approach by uniting individuals Fuzzy partitions, i.e. clusters, can be reached where when merging basically an analog The approach chosen is compared to dividing the Partitions.

Im weiteren werden die einzelnen Abschnitte des Verfahrens zum Bilden der Fuzzy-Partitionen, das heißt das Clustering des Zustandsraums ℵ und in dem Zustandsübergangs-Raum T, das Erhöhen der Genauigkeit der Cluster in dem Zustandsraum ℵ aufgrund der Cluster in T und schließlich das Ableiten des diskretisierten Modells aus den geclusterten Zustandsübergängen beschrieben.The following sections of the procedure to form the fuzzy partitions, that is, clustering of the state space ℵ and in the state transition space T that Increase the accuracy of the clusters in the state space ℵ because of the clusters in T and finally deriving the discretized model from the clustered State transitions described.

Das Clustern des Zustandsraums ℵ in Fuzzy-Partitionen wird verwendet zum Beschreiben einer im weiteren beschriebenen Q- Funktion im Zusammenhang mit einem Reinforcement- Lernverfahren.The clustering of state space ℵ in fuzzy partitions becomes used to describe a Q- described below Function in connection with a reinforcement Learning process.

Die Cluster werden auf inkrementelle Weise erzeugt.The clusters are created in an incremental manner.

Jedes Cluster c ℵ|i wird zu der jeweiligen Iteration k gekennzeichnet durch das jeweilige Cluster-Zentrum ℵ|i,k, einen Zählerwert M ℵ|i,k zum Zählen der Anzahl der Zustände, die dem Cluster c ℵ|i aufgrund der vorangegangenen Verfahrensschritte, das heißt Iterationen, zugeordnet worden sind und einer Diagonalmatrix A ℵ|i,k, die im weiteren auch als Skalierungsmatrix bezeichnet wird, durch die die Größe des jeweiligen Clusters bestimmt wird.Each cluster c ℵ | i is characterized for the respective iteration k by the respective cluster center ℵ | i, k, a counter value M ℵ | i, k for counting the number of states that the cluster c ℵ | i based on the previous one Process steps, that is to say iterations, have been assigned and a diagonal matrix A ℵ | i, k, which is also referred to below as a scaling matrix, by means of which the size of the respective cluster is determined.

Im weiteren wird die Gesamtheit aller Cluster in dem Zustandsraum ℵ zu einer Iteration k bezeichnet mit C ℵ|k.Furthermore, the entirety of all clusters in the State space ℵ for an iteration k denoted by C ℵ | k.

Aufgrund der gemäß diesem Ausführungsbeispiel allgemein nicht erforderlichen, vorgesehenen Diagonalform der Skalierungsmatrix A ℵ|i,k sind alle Cluster in allen Dimensionen symmetrisch. Jedoch kann die Skalierung der Dimensionen variiert werden.Because the diagonal shape of the scaling matrix A ℵ | i, k, which is generally not required in accordance with this exemplary embodiment, all clusters are symmetrical in all dimensions. However, the scaling of the dimensions can be varied.

Zu Beginn des Verfahrens werden alle Cluster mit der gleichen Skalierungsmatrix Â ^ℵ initialisiert.At the beginning of the procedure, all clusters are initialized with the same scaling ^matrix Â ^ℵ .

Wie im weiteren noch näher erläutert wird, wird aufgrund eines Aufteilens eines Clusters in zwei Cluster hinsichtlich einer Dimension d eine Reduzierung der Größe des jeweiligen Clusters in der jeweiligen Dimension d erreicht.As will be explained in more detail below, is due to dividing a cluster into two clusters with respect a dimension d a reduction in the size of each Clusters achieved in the respective dimension d.

Wird während der Lernphase ein neuer Zustand x _k ermittelt, so wird der Abstand des neu ermittelten Zustands x _k zu allen existierenden Clustern bestimmt.If a new state x _{k is} determined during the learning phase, the distance between the newly determined state x _k and all existing clusters is determined.

Wenn kein Cluster c ℵ|i existiert, zu dem der Abstand dist ℵ|k(x, c ℵ|i) des neuen Zustands x _k kleiner ist als ein vorgegebener maximaler Abstand d ℵ|max, so wird ein neues Cluster c ℵ|i, mit einem neuen Zentrum
If there is no cluster c ℵ | i to which the distance dist ℵ | k ( x , c ℵ | i) of the new state x _{k is} smaller than a predetermined maximum distance d ℵ | max, then a new cluster c ℵ | i, with a new center

und einem auf den Wert "0' initialisierten neuen Zähler
and a new counter initialized to the value "0"

und einer neuen Skalierungsmatrix
and a new scaling matrix

erzeugt.generated.

Der maximale Abstand d ℵ|max kann von dem Benutzer vorgegeben werden und hängt üblicherweise ab von der Initialisierungs- Diagonalmatrix A ℵ|i,k und der gewünschten Größe der initialisierten Cluster.The maximum distance d ℵ | max can be specified by the user and usually depends on the initialization diagonal matrix A ℵ | i, k and the desired size of the initialized clusters.

Das Cluster c ℵ|i₀ ∈ C ℵ|k, zu dem der neue Zustand x _k den geringsten Abstand aufweist, wird in einem weiteren Schritt in Richtung des neu ermittelten Zustands x _k innerhalb des Zustandsraums ℵ verschoben.The cluster c ℵ | i₀ ∈ C ℵ | k, to which the new state x _{k has} the smallest distance, is shifted in a further step in the direction of the newly determined state x _k within the state space ℵ.

Die Schrittgröße des jeweiligen Verschiebeschritts wird bestimmt durch die Fuzzy-Zugehörigkeitsfunktion gemäß folgender Vorschrift:
The step size of the respective shift step is determined by the fuzzy membership function according to the following rule:

des neuen Zustands x _k in dem Cluster c ℵ|i₀ und der Anzahl von Zuständen, die zuvor dem Cluster c ℵ|i₀ zugeordnet worden sind, bezeichnet mit M ℵ|i₀,k, womit sich ein neuer Zählerwert M ℵ|i₀,k+1 und ein neues, aktualisiertes Zentrum ℵ|i₀,k+1 des jeweils ausgewählten Clusters c ℵ|i₀ ergeben gemäß folgenden Vorschriften:
of the new state x _k in the cluster c ℵ | i₀ and the number of states that were previously assigned to the cluster c ℵ | i₀ are denoted by M ℵ | i₀, k, which results in a new counter value M ℵ | i₀, k +1 and a new, updated center ℵ | i₀, k + 1 of the respectively selected cluster c ℵ | i₀ result according to the following regulations:

Diese alternative Vorgehensweise kann anschaulich als eine inkrementelle Variante des in [2] beschriebenen Fuzzy-C- Means-Clustering-Verfahrens angesehen werden.This alternative approach can be illustrated as one incremental variant of the fuzzy-C- described in [2] Means clustering process can be considered.

Gemäß diesem Ausführungsbeispiel wird ein Fuzzyfizierungswert m in Vorschrift (24) mit dem Wert 2 verwendet.According to this embodiment, a fuzzification value m used in regulation (24) with the value 2.

In einer alternativen Vorgehensweise ist es möglich, an Stelle lediglich des ausgewählten Zentrums ℵ|i₀,k+1 die Zentren aller Cluster in Richtung des neu ermittelten Zustands x _k zu verschieben.In an alternative procedure, it is possible to shift the centers of all clusters in the direction of the newly determined state x _k instead of just the selected center ℵ | i₀, k + 1.

Ziel des im weiteren beschriebenen Clusterings des Zustandsübergangs-Raums T ist es, eine kompakte Beschreibung der beobachteten Zustandsübergänge während der Lernphase zu erzeugen.The aim of the clustering described below State transition space T is a compact description of the observed state transitions during the learning phase produce.

Wie im weiteren beschrieben wird, wird diese Beschreibung eingesetzt, um sinnvolle Aufteilungen von Clustern in dem Zustandsraum ℵ und zum Abschätzen der durchschnittlichen Zustandsübergangs-Wahrscheinlichkeiten, die oben beschrieben worden sind, abzuschätzen sowie zum Abschätzen der Gewinne g verwendet.As will be described hereinafter, this description will be used to make sensible division of clusters in the State space ℵ and to estimate the average State transition probabilities described above have been estimated, as well as to estimate the profits g used.

Ein Cluster c T,u|j in dem Zustandsübergangs-Raum T ist gekennzeichnet durch seine Cluster-Zentren T,u|j,k, die sich gemäß folgender Vorschrift ergeben:
A cluster c T, u | j in the state transition space T is characterized by its cluster centers T, u | j, k, which result from the following rule:

Mit M T,u|j,k wird ein Zähler bezeichnet, mit dem die Anzahl der Zustandsübergänge angegeben werden, die diesem jeweiligen Cluster zugeordnet sind. M T, u | j, k denotes a counter with which the number of State transitions are given that this particular Clusters are assigned.

Mit einer Skalierungsmatrix A T,u|j,k und mit einem Index u für die jeweilige Aktion, die den jeweiligen Zustandsübergang erzeugt hat, welcher Zustandsübergang dem jeweiligen Cluster zugeordnet ist.With a scaling matrix A T, u | j, k, and with a subscript u for the action that created the respective state transition, which state transition is assigned to the respective cluster.

Die Gesamtheit der Cluster der Zustandsübergänge zu einer Aktion u ∈ U wird mit C T,u|k bezeichnet.The entirety of the clusters of state transitions to one Action u ∈ U is denoted by C T, u | k.

Die Skalierungsmatrix A T,u|j,k weist drei voneinander unabhängige Diagonalmatrizen B T,u|j,k, C T,u|j,k, b T,u|j,k auf, wobei eine erste Diagonalmatrix B T,u|j,k den jeweiligen Vorgängerzustand, eine zweite Diagonalmatrix C T,u|j,k einen Nachfolgezustand und eine dritte Diagonalmatrix b T,u|j,k den Gewinn, der durch den Zustandsübergang erzeugt wird, beschreiben.The scaling matrix A T, u | j, k comprises three independent diagonal matrices B T, u | j, k, C T, u | j, k, b T, u | j, k, whereby a first diagonal matrix B T, u | j, k describe the respective previous state, a second diagonal matrix C T, u | j, k a successor state and a third diagonal matrix b T, u | j, k describe the gain that is generated by the state transition.

Es ergibt sich somit für die Skalierungsmatrix A T,u|j,k folgende Vorschrift:
This results in the scaling matrix A T, u | j, k the following rule:

Um zu ermitteln, ob ein Aufteilen eines Clusters in zwei Cluster entlang einer Dimension d in dem Zustandsraum ℵ sinnvoll ist, sollte die Auflösung der Clusterung in dem Zustandsübergangs-Raum T in Abhängigkeit der Auflösung der Clusterung in dem Zustandsraum ℵ gewählt werden. To determine if a cluster is split into two Cluster along a dimension d in the state space ℵ is sensible, the resolution of the clustering in the State transition space T depending on the resolution of the Clustering in the state space ℵ can be selected.

Es wird angenommen, dass c ℵ|i'₀ ein Cluster in dem Zustandsraum ℵ ist, welches Cluster der Komponente T,u|j,k des Cluster- Zentrums des Clusters c T,u|j am nächsten ist und das mit c ℵ|i"₀ das Cluster bezeichnet wird, welches der Komponente T,u|j,k am nächsten liegt.It is assumed that c ℵ | i'₀ is a cluster in the state space ℵ is which cluster of component T, u | j, k of the cluster Center of the cluster c T, u | j is closest and with c ℵ | i "₀ the cluster is designated which of the component T, u | j, k am next lies.

Gemäß der heuristischen Vorgehensweise in diesem Ausführungsbeispiel hat es sich als vorteilhaft herausgestellt, die Größe des Clusters c T,u|j in der Richtung ℵ' halb so groß zu machen, wie die Größe des Clusters c ℵ|i'₀ und die Größe des Clusters c T,u|j in Richtung ℵ" so groß zu wählen, wie die Größe des Clusters c ℵ|i'₀.According to the heuristic approach in this Embodiment has proven to be advantageous exposed the size of the cluster c T, u | j in the direction Halb 'to be half the size of the cluster c ℵ | i'₀ and to choose the size of the cluster c T, u | j in the direction ℵ "so large like the size of the cluster c ℵ | i'₀.

Auf diese Weise ergeben sich die erste Diagonalmatrix B T,u|j,k und die zweite Diagonalmatrix C T,u|j,k des Clusters c T,u|j gemäß folgenden Vorschriften:
In this way, the first diagonal matrix B T, u | j, k and the second diagonal matrix C T, u | j, k of the cluster c T, u | j result according to the following regulations:

Die dritte Diagonalmatrix b T,u|j,k wird konstant gewählt, beispielsweise gemäß folgender Vorschrift:
The third diagonal matrix b T, u | j, k is chosen to be constant, for example according to the following rule:

wenn Gewinne mit einem Abstand unterschieden werden sollen.when winnings are differentiated should.

Auf der Basis der oben dargestellten Skalierungsmatrizen A T,u|j,k wird ein Abstandsmaß dist T|k(z, c T,u|j) ermittelt gemäß folgender Vorschrift:
On the basis of the scaling matrices A T, u | j, k shown above, a distance measure dist T | k ( z , c T, u | j) is determined in accordance with the following rule:

Wird ein neuer Zustandsübergang (x _k, u_k, x _k+1, _k)ermittelt, so wird geprüft, ob zumindest ein Cluster
If a new state transition ( x _k , u _k , x _{k + 1} , _k ) is determined, it is checked whether at least one cluster

existiert, zu dem der Vektor
to which the vector exists

einen Abstand aufweist, der kleiner ist als ein vorgegebener maximaler Zustandsübergangs-Abstand d T|max.has a distance that is smaller than a predetermined one maximum state transition distance d T | max.

Ist dies nicht der Fall, so wird ein neues Cluster c mit einem Cluster-Zentrum
If this is not the case, then a new cluster c with a cluster center

einem mit dem Wert "0" initialisierten neuen Zähler
a new counter initialized with the value "0"

und einer neuen Skalierungsmatrix A in der Gesamtheit aller Cluster gebildet.and a new scaling matrix A in the entirety of all clusters.

Der maximale Zustandsübergangs-Abstand d T|max kann, muss jedoch nicht, den gleichen Wert aufweisen wie der maximale Abstand d ℵ|max hinsichtlich des Zustandsraums ℵ.The maximum state transition distance d T | max can, must however, do not have the same value as the maximum Distance d ℵ | max with respect to the state space ℵ.

Je kleiner der maximale Zustandsübergangs-Abstand d T|max gewählt wird, um so feiner wird der Zustandsübergangs-Raum T geclustert.The smaller the maximum state transition distance d T | max is selected, the finer the state transition space T clustered.

Für
For

wird jeder Zustandsübergang in dem Zustandsübergangs-Raum T explizit in dem Speicher des Steuerrechners 301 gespeichert.each state transition in state transition space T is explicitly stored in the memory of control computer 301 .

In einem weiteren Schritt werden alle Cluster c ∈ C in Richtung des Vektors z _k gemäß ihrer jeweiligen Zugehörigkeit, die sich gemäß folgender Vorschrift ergibt:
In a further step, all clusters c ∈ C in the direction of the vector z _k according to their respective affiliation, which results according to the following rule:

verschoben und der Zähler des jeweiligen Clusters wird erhöht, so dass sich aktualisierte Werte des Zählers M und des jeweiligen Cluster-Zentrums gemäß folgenden Vorschriften ergeben:
shifted and the counter of the respective cluster is increased, so that updated values of the counter M and of the respective cluster center result in accordance with the following regulations:

Anschaulich ist das Aufteilen eines Clusters c ℵ|i ∈ C ℵ|k in Dimension d in dem Zustandsraum ℵ sinnvoll, wenn es eine detailliertere Modellierung der Zustandsübergangs- Wahrscheinlichkeiten oder der Gewinne ermöglicht.The division of a cluster c ℵ | i ∈ C ℵ | k in is clear Dimension d in the state space ℵ makes sense if there is a more detailed modeling of the state transition Probabilities or the profits enabled.

Dies ist der Fall, wenn zwei Cluster c T,u|j und c T,u|l in dem Zustandsübergangs-Raum T existieren, die beide einen hohen Zugehörigkeitswert v u|d,j,l,k(c ℵ|i) zu dem Cluster, das aufgeteilt werden soll, aufweist und deren Zentren einen deutlichen Abstand zueinander hinsichtlich der Richtung ℵ" × aufweisen.This is the case if two clusters c T, u | j and c T, u | l in the State transition space T exist, both of which are high Membership value v u | d, j, l, k (c ℵ | i) to the cluster that split should have, and their centers a clear Have a distance from each other with respect to the direction ℵ "×.

einen vorgebbaren Schwellenwert v^min für mindestens ein Paar von Clustern c T,u|j, c T,u|l ∈ C^T,u und eine Aktion u ∈ U überschreitet, das heißt, dass gilt:
exceeds a predeterminable threshold value v ^min for at least one pair of clusters c T, u | j, c T, u | l ∈ C ^{T, u} and an action u ∈ U, that is to say:

In der Vorschrift (41) zeigt die Sigmoid-Funktion
In the regulation (41) shows the sigmoid function

an, ob ( T,u|l,k)_d größer ist als ( ℵ|i,k)_d.indicates whether (T, u | l, k) _{d is} greater than (ℵ | i, k) _d .

Entsprechend zeigt die Vorschrift
The regulation shows accordingly

an, ob ( T,u|j,k)_d kleiner ist als ( ℵ|i,k)_d.indicates whether (T, u | j, k) _{d is} smaller than (ℵ | i, k) _d .

Mit der Funktion
With the function

wird angezeigt, ob die Cluster c T,u|j und c T,u|l einen deutlichen Abstand zueinander in Richtung der Dimension ℵ"× aufweisen, wobei der Abstand dist(c T,u|j, c T,u|l) gegeben ist gemäß folgender Vorschrift:
it is indicated whether the clusters c T, u | j and c T, u | l are clearly separated from one another in the direction of the dimension ℵ "×, the distance dist (c T, u | j, c T, u | l ) is given according to the following regulation:

in dem Zustandsraum und dem Abstand dist(c T,u|j, c T,u|l)
in the state space and the distance dist (c T, u | j, c T, u | l)

in dem Raum der Gewinne, die durch die Zustandsübergänge generiert werden.in the space of profits through the state transitions to be generated.

Gemäß dem Ausführungsbeispiel hat es sich als vorteilhaft herausgestellt, die einzelnen Parameter gemäß folgender Vorschriften zu wählen:
According to the exemplary embodiment, it has proven to be advantageous to select the individual parameters in accordance with the following regulations:

Die Dimension d₀ der Cluster-Zentren der Cluster c ℵ|i' und c ℵ|i" werden jeweils in entgegengesetzte Richtungen bezüglich der Dimension d₀ um den halben Radius
The dimension d _{0 of} the cluster centers of the clusters c ℵ | i 'and c ℵ | i "are in opposite directions with respect to the dimension d ₀ by half the radius

Es ergeben sich somit für die neuen Cluster c ℵ|i' und c ℵ|i" folgende Aktualisierungsvorschriften:
The following update rules thus result for the new clusters c ℵ | i 'and c ℵ | i ":

Die Größe der neuen Cluster c ℵ|i' und c ℵ|i" in Richtung der Dimension d₀ wird halbiert, das heißt, es ergeben sich hinsichtlich der Größe, das heißt der Skalierungsmatrix der neuen Cluster c ℵ|i' und c ℵ|i" folgende Aktualisierungsvorschriften:
The size of the new clusters c ℵ | i 'and c ℵ | i "in the direction of dimension d ₀ is halved, that is, the size, that is, the scaling matrix of the new clusters c ℵ | i' and c ℵ | i "the following update rules:

Es ergeben sich somit folgende Aktualisierungsvorschriften für die Zähler der neuen Cluster c ℵ|i' und c ℵ|i":
This results in the following update rules for the counters of the new clusters c ℵ | i 'and c ℵ | i ":

so dass die neuen Cluster sich an neu ermittelte Zustände x _k in gleicher Geschwindigkeit anpassen, wie es das ursprüngliche Cluster c ℵ|i₀ getan hätte. so that the new clusters adapt to newly determined states x _k at the same speed as the original cluster c ℵ | i₀ would have done.

Aufgrund der Anpassung der Größe der einzelnen Cluster in dem Zustandsübergangs-Raum T an die Größe der benachbarten Cluster in dem Zustandsraum ℵ führt ein Aufteilen der Cluster in dem Zustandsraum ℵ auch zu einer höheren Auflösung der Clusterung in dem Zustandsübergangs-Raum T.Because of the size adjustment of each cluster in the State transition space T to the size of the neighboring one Cluster in the state space ℵ divides the Clusters in the state space ℵ also to a higher one Dissolution of the clustering in the state transition space T.

Dies kann zu weiteren Aufteilungen der Cluster führen.This can lead to further division of the clusters.

Somit kann die Fuzzy-Partitionierung des Zustandsraums ℵ grundsätzlich beliebig genau gewählt werden, wenn jede Aufteilung eines Clusters zu einer genaueren internen Modellbeschreibung führt.Thus the fuzzy partitioning of the state space ℵ can be chosen exactly as desired, if any Splitting a cluster into a more accurate internal one Model description leads.

Jedoch kann das Erzeugen von Clustern auf zwei Wegen beschränkt werden.However, creating clusters can be done in two ways be restricted.

Zum einen kann eine maximale Anzahl von Aufteilungen, die auf ein Cluster angewendet werden darf, vorgegeben werden.For one thing, there can be a maximum number of divisions based on a cluster may be used.

Weiterhin kann der Schwellenwert v^min, mit dem das Aufteilen der Cluster gesteuert wird, entsprechend der Anzahl existierender Cluster |C^ℵ| erhöht werden.Furthermore, the threshold value v ^min , with which the division of the clusters is controlled, can correspond to the number of existing clusters | C ^ℵ | increase.

Wie im weiteren noch detailliert erläutert wird, kann auf der Grundlage der ermittelten Cluster c T,u|l ∈ C T,u|k und der dem jeweiligen Cluster c T,u|l zugeordneten Zähler M T,u|l,k, mit dem die Anzahl der Zustandsübergänge, die diesem jeweiligen Cluster zugeordnet sind, ermittelt werden.As will be explained in more detail below, the Basis of the determined clusters c T, u | l ∈ C T, u | k and the counter M T, u | l, k assigned to the respective cluster c T, u | l, with the the number of state transitions that this particular Clusters are assigned to be determined.

Mit
With

kann abgeschätzt werden, wie oft die Aktion a durchgeführt worden ist in dem Zustand c ℵ|i und wie oft der Zustandsübergang beobachtet worden ist, der durch das Cluster c T,u|l beschrieben wird.can be estimated how often action a is carried out has been in the state c ℵ | i and how often the State transition has been observed through the cluster c T, u | l is described.

Somit wird durch den Quotienten q_i,l,k(u), der gemäß folgender Vorschrift gebildet wird:
Thus, the quotient q _{i, l, k} (u), which is formed according to the following rule:

die Wahrscheinlichkeit abgeschätzt, dass das Ausführen der Aktion u in dem Zustand c ℵ|i in einem Zustandsübergang, der durch das Cluster c T,u|l beschrieben wird, resultiert.estimated the likelihood that the Action u in the state c ℵ | i in a state transition that is described by the cluster c T, u | l.

Deshalb kann die durchschnittliche Wahrscheinlichkeit p_i,j,k(u) eines Zustandsübergangs von einem Vorgängerzustand c ℵ|i in einen Nachfolgezustand c ℵ|j durch eine angenäherte Wahrscheinlichkeit _i,j,k(u), gebildet gemäß folgender Vorschrift:
Therefore, the average probability p _{i, j, k} (u) of a state transition from a previous state c ℵ | i to a successor state c ℵ | j can be formed by an approximate probability _{i, j, k} (u), according to the following rule:

abgeschätzt werden.can be estimated.

Entsprechend kann der durchschnittliche Gewinn für das Ausführen der Aktion u in dem Zustand c ℵ|i und einem Zustandsübergang zu dem Zustand c ℵ|j angenähert werden gemäß folgender Vorschrift:
Accordingly, the average profit for executing the action u in the state c ℵ | i and a state transition to the state c ℵ | j can be approximated according to the following rule:

Es ist in diesem Zusammenhang anzumerken, dass das oben beschriebene Verfahren zum Bilden von Fuzzy-Clustern auch unabhängig von dem im weiteren beschriebenen Reinforcement- Lernverfahren im Zusammenhang mit der Auswahl von Rahmensignalplänen, allgemein im Zusammenhang mit der Steuerung eines technischen Systems, eingesetzt werden kann.In this context, it should be noted that the above described methods for forming fuzzy clusters as well regardless of the reinforcement described below Learning process related to the selection of Framework signal plans, generally related to the Control of a technical system that can be used.

Anschaulich kann das oben beschriebene Vorgehen darin gesehen werden, dass ein Cluster eines Zustandsraums oder eines Zustandsübergangs-Raums in mindestens zwei oder mehr Cluster aufgeteilt wird, wenn aus den geclusterten Zustandsübergängen ersichtlich ist, dass durch das Aufteilen verschiedener Gruppen von Zustandsübergängen, beispielsweise unterschiedliche Nachfolgezustände und/oder unterschiedliche Gewinne erzeugt werden, die voneinander unterschieden werden können.The procedure described above can be clearly seen in this that a cluster of a state space or a State transition space in at least two or more clusters is divided if from the clustered state transitions it can be seen that by dividing different Groups of state transitions, for example different successor states and / or different Profits are generated that are differentiated from one another can.

Anschaulich kann diese Vorgehensweise somit als eine Art Mittelweg zwischen einer expliziten Speicherung aller Zustandsübergänge und dem bloßen Zählen von Zustandsübergängen zwischen gegebenen Partitionen des Zustandsraums angesehen werden.This procedure can clearly be seen as a kind Midway between an explicit storage of all State transitions and the mere counting of state transitions between given partitions of the state space will.

Auf diese Weise werden die Vorteile einer expliziten Speicherung, nämlich eine sehr gute Partitionierung des Zustandsraums und dem Zählen von Zustandsübergängen, das heißt eine sehr kompakte Repräsentation eines Modells des technischen Systems, gemäß der oben beschriebenen Vorgehensweise vereint werden.This way, the benefits of being explicit Storage, namely a very good partitioning of the State space and the counting of state transitions, the is called a very compact representation of a model of the technical system, according to that described above Approach to be united.

Es ist darauf hinzuweisen, dass die auf die oben beschriebene Weise ermittelte Partitionierung gegenüber einer ebenfalls alternativ möglichen, festgelegten, d. h. manuellen Partitionierung der Fuzzy-Partitionen das Reinforcement- Lernen, wie es im weiteren beschrieben wird, erheblich beschleunigt.It should be noted that the on the above Partitioning determined in this way as well as one alternatively possible, fixed, d. H. manual Partitioning the fuzzy partitions the reinforcement Learn considerably as described below accelerates.

Unter Verwendung von ermittelten Trainingsdaten sowie der auf die oben beschriebene Weise ermittelten Fuzzy-Partitionen, das heißt den Fuzzy-Clustern, wird ein im weiteren beschriebenes Reinforcement-Lernverfahren durchgeführt.Using determined training data as well as the on the fuzzy partitions determined as described above, that is, the fuzzy clusters become one in the further described reinforcement learning process.

Zur Erleichterung des Verständnisses wird im weiteren ein kurzer Überblick über Grundlagen des Reinforcement-Lernens gegeben.To facilitate understanding, a Brief overview of the basics of reinforcement learning given.

Die Grundidee des modellbasierten Reinforcement-Lernens ist es, zu Beginn des Lernverfahrens eine Maximum-Likelihood- Schätzung des Modells des zu steuernden Systems durchzuführen und die optimierte Kontrollstrategie, das heißt das optimierte Steuern durch Auswahl von Steuergrößen (indirekt) basierend auf der zuvor ermittelten Modellbeschreibung zu trainieren.The basic idea of model-based reinforcement learning is a maximum likelihood at the beginning of the learning process Estimate the model of the system to be controlled and the optimized control strategy, that is optimized taxes through selection of tax parameters (indirect) based on the previously determined model description work out.

Diese zwei Phasen können einander überlappen, das heißt zuvor trainierte Strategien können von der zu Beginn ermittelten Modellbeschreibung abgeleitet werden, basierend auf beobachteten Zustandsübergängen während einer Lernphase, und die Information für eine zukünftige Ableitung der Steuerstrategie, das heißt der Auswahl der Steuergrößen, kann mittels dieser Kontrollstrategien gewonnen werden.These two phases can overlap each other, that is, before trained strategies can be determined from the one initially Model description can be derived based on observed state transitions during a learning phase, and the information for a future derivation of the Tax strategy, that is, the selection of tax parameters, can by means of these control strategies.

Bei einem diskreten indirekten Reinforcement-Lernverfahren erfolgt eine Maximum-Likelihood-Schätzung des Modells des technischen Systems auf der Grundlage von diskreten Zählern, mit denen die Anzahl ausgeführter Aktionen und der sich daraus ergebenden Zustandsübergänge und auf der Grundlage von Variablen für die beobachteten Gewinne. In a discrete indirect reinforcement learning process there is a maximum likelihood estimate of the model of the technical system based on discrete counters, with which the number of actions carried out and the resulting state transitions and on the basis of Variables for the observed gains.

Die Zähler und Variablen werden im weiteren näher erläutert.The counters and variables are explained in more detail below.

Mit N 0|i,u,k und M 0|i,u,j,k, i = 1, . . ., N^ℵ, u = 1, . . ., N^A, j = 1, . . ., N^ℵ, k ∈ N, werden Zähler bezeichnet, mit denen die Anzahl durchgeführter Fuzzy-Aktionen A_u in einem Fuzzy- Zustand X_i und die Anzahl von Zustandsübergängen von einem Zustand X_i in einen Nachfolgezustand X_j aufgrund der Aktion A_u bis zu einer Iteration k bezeichnet.With N 0 | i, u, k and M 0 | i, u, j, k, i = 1,. . ., N ^ℵ , u = 1 ^,. . ., N ^A, j = 1,. . ., N ^ℵ , k ∈ N, are counters with which the number of fuzzy actions A _u carried out in a fuzzy state X _i and the number of state transitions from a state X _i to a successor state X _j due to the action A _u referred to an iteration k.

Wird ein Zustandsübergang (x _k, a _k, x _k+1, g_k) beobachtet, x _k ∈ ℵ, x _k+1 ∈ ℵ, a _k ∈ A, g_k ∈ , werden die Zähler N 0|i,u,k und M 0|i,u,j,k gemäß dem Grad der Zugehörigkeit zu den entsprechenden Cluster-Zentren gemäß folgender Vorschriften erhöht:
If a state transition ( x _k , a _k , x _{k + 1} , g _k ) is observed, x _k ∈ ℵ, x _{k + 1} ∈ ℵ, a _k ∈ A, g _k ∈, the counters N 0 | i, u , k and M 0 | i, u, j, k increased according to the degree of belonging to the corresponding cluster centers according to the following regulations:

Anschließend werden die Zähler N 0|i,u,k und M 0|i,u,j,k verwendet, um darauf basierend die durchschnittlichen bedingten Wahrscheinlichkeiten
The counters N 0 | i, u, k and M 0 | i, u, j, k are then used to calculate the average conditional probabilities based on this

für einen Zustandsübergang von einem Zustand X_i in einen Nachfolgezustand X_j aufgrund der Aktion A_u geschätzt gemäß folgender Vorschrift:
for a state transition from a state X _i to a subsequent state X _j estimated on the basis of the action A _u according to the following rule:

Im weiteren wird mit r 0|iuj der durchschnittliche Gewinn bezeichnet, den man erhält, wenn in dem Vorgängerzustand X_i aufgrund des Ausführens der Aktion A_u der Nachfolgezustand X_j in dem Zustandsraum ℵ eingenommen wird.Furthermore, r 0 | iuj denotes the average profit that is obtained if, in the predecessor state X _i, due to the execution of the action A _u, the successor state X _{j is} assumed in the state space ℵ.

Der Gewinn r 0|iuj ergibt sich somit gemäß folgender Vorschrift:
The profit r 0 | iuj thus results according to the following rule:

Eine Schätzung des jeweiligen Gewinns r 0|iuj, das heißt ein geschätzter Gewinn 0|iuj, wird gemäß folgender Aktualisierungsvorschrift ermittelt:
An estimate of the respective profit r 0 | iuj, that is to say an estimated profit 0 | iuj, is determined in accordance with the following update rule:

Mit
With

i = 1, . . ., N^ℵ, u = 1, . . ., N^A, j = 1, . . ., N^ℵ (74)
i = 1,. . ., N ^ℵ , u = 1 ^,. . ., N ^A, j = 1,. . ., N ^ℵ (74)

bei Beobachten eines Zustandsübergangs (x _k, a _k, x _k+1, g_k), x _k ∈ ℵ, x _k+1 ∈ ℵ, a _k ∈ A, g_k ∈ .when observing a state transition ( x _k , a _k , x _{k + 1} , g _k ), x _k ∈ ℵ, x _{k + 1} ∈ ℵ, a _k ∈ A, g _k ∈.

Für dieses diskrete Modell
For this discrete model

kann eine optimale Steuerungsstrategie gemäß dem Reinforcement-Lernverfahren ermittelt werden. can an optimal control strategy according to the Reinforcement learning processes can be determined.

Mit Q(x, a) wird der wahre, kontinuierliche Q-Wert im Rahmen des Reinforcement-Lernverfahrens bezeichnet, der gebildet wird gemäß folgender Vorschrift:
Q ( x , a ) denotes the true, continuous Q value in the context of the reinforcement learning process, which is formed according to the following rule:

Auf der Grundlage des wahren, kontinuierlichen Q-Werts Q(x, a) ergibt sich ein geschätzter Q-Wert 0|iu der durchschnittlichen Q-Werte gemäß folgender Vorschrift:
On the basis of the true, continuous Q value Q ( x , a ), an estimated Q value 0 | iu of the average Q values is obtained according to the following rule:

der sich ergibt aus der Fixpunkt-Lösung des folgenden Gleichungssystems:
which results from the fixed point solution of the following system of equations:

Die kontinuierlichen Q-Werte Q(x, a) werden gemäß diesem Ausführungsbeispiel durch ein sogenanntes Takagi-Sugeno- Fuzzy-System, wie es in [3] beschrieben ist, mit linearen Termen in den Konsequenzen der Fuzzy-Regeln angenähert gemäß folgender Vorschrift:
According to this exemplary embodiment, the continuous Q values Q ( x , a ) are approximated by a so-called Takagi-Sugeno-fuzzy system, as described in [3], with linear terms in the consequences of the fuzzy rules in accordance with the following regulation:

if x is X_i and a is A_u
if x is X _i and a is A _u

then
then

mit i = 1, . . . N^ℵ, u = 1, . . ., N^A, wobei gilt:
with i = 1,. . . N ^ℵ , u = 1 ^,. . ., N ^A , where:

Aufgrund der Orthogonalität der Fuzzy- Zugehörigkeitsfunktionen kann Vorschrift (79) geschrieben werden als folgende Vorschrift:
Due to the orthogonality of the fuzzy membership functions, regulation (79) can be written as the following regulation:

Die Terme Q 0|iu können durch Ermitteln der Fixpunkt-Lösung der Gleichungssysteme (78) mit den Abschätzungen 0|ij der durchschnittlichen bedingten Zustandsübergangswahrscheinlichkeiten gemäß Vorschrift (70) und Schätzwerten 0|iuj der durchschnittlichen Gewinne gemäß Vorschrift (72) ermittelt werden.The terms Q 0 | iu can be determined by determining the fixed point solution of the Systems of equations (78) with the estimates 0 | ij of average conditional State transition probabilities according to regulation (70) and estimates 0 | iuj of average profits according to Regulation (72) can be determined.

Für den diskreten Fall ist in [3] eine spezielle Implementierung der oben beschriebenen Vorgehensweise zur rekursiven Lösung der sogenannten Bellmann-Gleichung (78) beschrieben. For the discrete case there is a special one in [3] Implementation of the procedure for recursive solution of the so-called Bellmann equation (78) described.

Die Grundidee des aus [3] bekannten Ansatzes ist es, das rekursive Aktualisieren der Q-Werte entsprechend der Änderung der Q-Werte zu priorisieren, wie sie aus der Aktualisierung resultieren.The basic idea of the approach known from [3] is that recursively update the Q values according to the change prioritize the Q values as they come from the update result.

Aufgrund dieser Vorgehensweise wird die Geschwindigkeit der Konvergenz der Fixpunkt-Lösung deutlich erhöht verglichen mit einer Aktualisierung gemäß einer festen Reihenfolge.Because of this approach, the speed of the Convergence of the fixed point solution increased significantly compared to an update according to a fixed order.

Da außerdem die Interpretation der Variablen 0|ij(u) und 0|iuj,k+1 der Bellmann-Gleichung (78) in dem diskreten Fall gleich ist, kann dieser vorteilhafte Aktualisierungsmechanismus auch für den gemäß diesem Ausführungsbeispiel der Erfindung vorgesehenen Ansatz unter Verwendung von Fuzzy-Partitionen im Rahmen des Reinforcement- Lernverfahrens eingesetzt werden.Since the interpretation of the variables 0 | ij (u) and 0 | iuj, k + 1 of the Bellmann equation (78) in the discrete case is the same, this can be advantageous Update mechanism also for that according to this Embodiment of the invention provided approach under Use of fuzzy partitions as part of the reinforcement Learning method can be used.

Die konstante Terme Q 0|iu werden durch Lösen der Bellmann- Gleichung (78) ermittelt.The constant terms Q 0 | iu are obtained by solving the Bellmann Equation (78) determined.

Die zugehörigen partiellen Ableitungen Q und Q können durch Bilden von Durchschnittswerten und partiellen Ableitungen der Gewinnfunktion und der bedingten Zustandsübergangs-Wahrscheinlichkeiten ermittelt werden.The associated partial derivatives Q and Q can by averaging and partial Derivatives of the profit function and the conditional State transition probabilities can be determined.

Die partiellen Ableitungen Q werden gemäß folgender Vorschrift gebildet:
The partial derivatives Q are formed according to the following rule:

mit den Abkürzungen:
with the abbreviations:

die in dem vorangegangenen Schritt verwendet worden sind.which were used in the previous step.

Das Ersetzen des Integrals durch die Summe lokaler Integrale gemäß den Vorschriften (86) und (87) und den Durchschnittswerten (88), (89) ist in dem Sinne konsistent, dass mit Erhöhen der Genauigkeit der Partitionierung des Zustandsraums diese immer besser werden.The replacement of the integral by the sum of local integrals according to the regulations (86) and (87) and the Averages (88), (89) is consistent in the sense that with increasing the accuracy of partitioning the State space these are getting better and better.

In analoger Weise kann gezeigt werden, dass gilt:
In an analogous way it can be shown that:

Der durchschnittliche lokale Gewinn r 0|iuj und die durchschnittlichen lokalen Ableitungen r und r der Gewinnfunktion g kann durch Anpassen der Parameter 0|iuj, , und der folgenden linearen Funktion abgeschätzt werden, abhängig von den Gewinnen in der näheren Umgebung der Cluster-Zentren (_i, _u, _j), gemäß folgender Vorschrift:
The average local gain r 0 | iuj and the average local derivatives r and r of the gain function g can be estimated by adjusting the parameters 0 | iuj,, and the following linear function, depending on the gains in the vicinity of the cluster centers ( _i , _u , _j ), according to the following regulation:

Diese Anpassung kann erfolgen mittels eines bekannten Gradientenabstiegs unter Berücksichtigung einer Fehlerfunktion E, die sich ergibt gemäß folgender Vorschrift:
This adaptation can take place by means of a known gradient descent, taking into account an error function E, which results according to the following rule:

bei Beobachten eines Zustandsüberganges (x _k, a _k, x _k+1, g_k).when observing a state transition ( x _k , a _k , x _{k + 1} , g _k ).

Somit ergeben sich gemäß diesem Ausführungsbeispiel folgende Aktualisierungsvorschriften:
According to this exemplary embodiment, the following update rules thus result:

wobei eine mögliche Wahl für die Schrittgröße η_iuj,k innerhalb der Aktualisierung gegeben sein kann gemäß folgender Vorschrift:
a possible choice for the step size η _{iuj, k} can be given within the update according to the following rule:

so dass die Schrittgröße η_iuj,k jeweils abhängig von dem Grad der Zugehörigkeit eines beobachteten Zustandsübergangs zu einem Cluster-Zentrum gewählt wird und mit fortlaufender Zeit verringert wird. so that the step size η _{iuj, k is} selected depending on the degree of belonging of an observed state transition to a cluster center and is reduced over time.

Die durchschnittlichen bedingten Wahrscheinlichkeiten p 0|ij(u) können gemäß Vorschrift (71) geschätzt werden.The average conditional probabilities p 0 | ij (u) can be estimated according to regulation (71).

Die durchschnittlichen partiellen Ableitungen p und p können gemäß folgenden Vorschriften approximiert werden:
The average partial derivatives p and p can be approximated according to the following rules:

wobei mit e d|l ein Vektor der Dimension d mit Vektorkomponenten e d|l,i = δ_ilbezeichnet wird.where e d | l denotes a vector of dimension d with vector components ed | l, i = δ _il .

Mit N wird ein Zähler bezeichnet, mit dem die Anzahl von Ausführungen einer Aktion A_u in einem Fuzzy-Zustand gezählt wird, der entsteht, indem Zustand X_i entlang der Dimension l um einen vorgebbaren Wert ε verschoben wird.N denotes a counter with which the number of executions of an action A _{u is} counted in a fuzzy state which arises when state X _{i is} shifted along dimension l by a predeterminable value ε.

Mit M wird ein weiterer Zähler bezeichnet, mit dem die Anzahl von Zustandsübergängen von dem um ε entlang der Dimension l verschobenen Zustand X_i zu einem Nachfolgezustand X_j aufgrund der Aktion A_u gezählt wird.M denotes another counter with which the number of state transitions from the state X _i shifted by ε along the dimension l to a subsequent state X _j is counted on the basis of the action A _u .

Zusätzlich wird mit N ein Zähler bezeichnet, mit dem die Anzahl durchgeführter Aktionen A_u in dem Zustand angegeben wird, der durch Verschieben von dem Zustand X_i entlang der Dimension l um einen negativen Wert -ε entsteht und mit M wird ein weiterer Zähler bezeichnet, mit dem die Anzahl von Zustandsübergängen in den Zustand X_j von diesem Zustand aufgrund der Aktion A_u angegeben wird.In addition, N denotes a counter with which the number of actions A _u carried out is given in the state which arises by shifting from state X _i along dimension l by a negative value −ε, and M denotes another counter, with which the number of state transitions to state X _j from this state is specified on the basis of action A _u .

Bei Ermitteln eines Zustandsübergangs (x _k, a _k, x _k+1, g_k) werden die einzelnen Zähler N, M, N, M gemäß folgenden Aktualisierungsvorschriften aktualisiert:
When determining a state transition ( x _k , a _k , x _{k + 1} , g _k ), the individual counters N, M, N, M are updated in accordance with the following update regulations:

Entsprechend werden Zähler N, M, N und M für den Aktionsraum gemäß folgenden Aktualisierungsvorschriften aktualisiert:
Accordingly, counters N, M, N and M for the action space are updated in accordance with the following update regulations:

Anschließend werden die lokalen partiellen Ableitungen p und p ermittelt gemäß folgenden Vorschriften:
The local partial derivatives p and p are then determined in accordance with the following regulations:

Mit den geschätzten Wahrscheinlichkeiten 0|iuj,k+1, und für die Wahrscheinlichkeiten p 0|iuj, p und p und der Schätzungen 0|iuj,k+1, und für die Gewinne r 0|iuj, r und r kann nunmehr die jeweilige lokale partielle Ableitung Q und Q gemäß den Vorschriften (85) und (90) ermittelt werden.With the estimated probabilities 0 | iuj, k + 1, and for the probabilities p 0 | iuj, p and p and the estimates 0 | iuj, k + 1, and for the gains r 0 | iuj, r and r can now be the respective local partial derivative Q and Q according to Regulations (85) and (90) can be determined.

Zusammenfassend kann das Reinforcement-Lernverfahren in Form eines Pseudo-Codes beschrieben werden wie folgt:In summary, the reinforcement learning process can be in the form of a pseudo code can be described as follows:

1. Initialization

for i = 1, . . ., N^ℵ, u = 1, . . ., N^A do
for i = 1,. . ., N ^ℵ , u = 1 ^,. . ., N ^A do

(a) N 0|iu ← 0
(b) N ← 0, N ← 0, l = 1, . . ., d^ℵ
(c) N ← 0, N ← 0, l = 1, . . ., d^A
(d) M 0|iuj ← 0, j = 1, . . ., N^ℵ
(e) M ← 0, M ← 0, j = 1, . . ., N^ℵ, l = 1, . . ., d^ℵ
(f) M ← 0, M ← 0, j = 1, . . ., N^ℵ, l = 1, . . ., d^A
(g) r 0|iuj ← 0, j = 1, . . ., N^ℵ
(h) r ← 0, r ← 0, j = 1, . . ., N^ℵ, l = 1, . . ., d^ℵ
(i) r ← 0, r ← 0, j = 1, . . ., N^ℵ, l = 1, . . ., d^A
(j) PQueue ← empty
(k) Beobachte Ausgangszustand x ₀
od(a) N 0 | iu ← 0
(b) N ← 0, N ← 0, l = 1,. . ., d ^ℵ
(c) N ← 0, N ← 0, l = 1,. . ., d ^A
(d) M 0 | iuj ← 0, j = 1,. . ., N ^ℵ
(e) M ← 0, M ← 0, j = 1,. . ., N ^ℵ , l = 1 ^,. . ., d ^ℵ
(f) M ← 0, M ← 0, j = 1,. . ., N ^ℵ , l = 1 ^,. . ., d ^A
(g) r 0 | iuj ← 0, j = 1,. . ., N ^ℵ
(h) r ← 0, r ← 0, j = 1,. . ., N ^ℵ , l = 1 ^,. . ., d ^ℵ
(i) r ← 0, r ← 0, j = 1,. . ., N ^ℵ , l = 1 ^,. . ., d ^A
(j) PQueue ← empty
(k) Observe initial state x ₀
or

2. Main program

for k = 0, 1, 2, . . . do
for k = 0, 1, 2,. . . do

a) Select fuzzy action u k in the current state x k according to the exploration strategy (e.g. Boltzmann exploration / F-ISE exploration). Choose continuous action a k from the set of states belonging to A u k ≠ 0.
b) Perform action a k and observe the successor state x k + 1 and g k = g ( x k , a k , x k + 1 )
c) for i = 1,. . ., N ℵ , j = 1 ,. . ., N ℵ do

a) Counting the state transitions

b) Estimating the state transition probabilities

c) Estimating the partial derivatives of the state transition probabilities

d) Calculate the deviation from the expected local profit

e) Update the average profit and average deviation estimates
d) for i = 1,. . ., N ℵ do

a) Calculate the priority of saving for (i, u k ):

b) if P <Φ k then add (i, u k ) to PQueue with priority P fi or
e) while PQueue ≠ empty do

a) (i, u) ← first (PQueue)

b) for all predecessors (l, w) of i, ie all pairs (l, w) with M 0 | lwi <0 do

c) if P <Φ k then add (l, w) to PQueue with priority P fi or
f) Estimating the derivatives of the Q values

Die optimale Steuerungsstrategie, das heißt die optimale Auswahl eines Rahmensignalplans aufgrund der ermittelten, gemessenen relativen Verkehrsdichte an den jeweiligen Sensoren 215, allgemein formuliert als optimale Kontrollstrategie µ : ℵ → A, wird dadurch erreicht, dass in dem jeweiligen Zustand x die Aktion a ausgewählt wird, das heißt beispielsweise gemäß dem Ausführungsbeispiel derjenige Rahmensignalplan ausgewählt wird, der einen Gewinn gemäß Vorschrift (79) verspricht, der maximal ist, das heißt bei dem gilt:
The optimal control strategy, that is, the optimal selection of a frame signal plan based on the determined, measured relative traffic density at the respective sensors 215 , generally formulated as an optimal control strategy µ: ℵ → A, is achieved in that the action a is selected in the respective state x , that is to say, for example, that frame signal plan is selected according to the exemplary embodiment that promises a profit according to regulation (79) that is maximum, that is to say the following applies:

Das oben beschriebene Verfahren kann weiterhin gemäß der im weiteren beschriebenen Ausgestaltung der Erfindung weiter verbessert werden. The method described above can continue according to the in further described embodiment of the invention be improved.

Um die Anzahl der benötigten Trainingsschritte im Rahmen des Reinforcement-Lernverfahrens zu verringern ist es nützlich, gezielt den erwarteten Gewinn im Sinne eines Informationsgehalts der Trainingsdaten über das technische System zu nutzen, das heißt in anderen Worten, in jedem Zustand diejenige Aktion auszuführen, durch entweder ein großer unmittelbarer, das heißt sofortiger Gewinn an Information erwartet werden kann oder durch die ein Bereich in dem Zustandsraum erreicht wird, in dem hohe Gewinne an Information erwartet werden können.To the number of training steps required within the To reduce reinforcement learning process it is useful targeted the expected profit in the sense of a Information content of the training data on the technical To use the system, in other words, in everyone State to perform that action by either a great immediate, that is, immediate profit Information can be expected or through which an area is achieved in the state space in which high profits Information can be expected.

Gemäß diesem Ausführungsbeispiel wird eine modellbasierte Explorationsstrategie vorgesehen.According to this embodiment, a model-based Exploration strategy planned.

Die im weiteren beschriebene Vorgehensweise basiert auf A- Werten A_iu, i = 1, . . ., N^ℵ, u = 1, . . ., N^A, mit denen die "Attraktivität" des Ausführens der jeweiligen Fuzzy-Aktion A_u in dem Zustand X_i bezeichnet wird.The procedure described below is based on A values A _iu , i = 1 _,. . ., N ^ℵ , u = 1 ^,. . ., N ^A , with which the "attractiveness" of executing the respective fuzzy action A _u in the state X _i is designated.

Das Ausführen einer Aktion in einem Zustand des Zustandsraums ℵ führt dann mit einer großen Wahrscheinlichkeit zu einem hohen Informationsgewinn, wenn ein großer sofortiger Gewinn an Information erwartet werden kann aufgrund der Ausführung der Aktion A_u, oder dann, wenn das zu steuernde technische System aufgrund der Aktion in Zustände übergeht, in denen ein großer Informationsgewinn erwartet werden kann.Carrying out an action in a state of the state space ℵ then results in a high probability of a high information gain if a large immediate gain of information can be expected due to the execution of the action A _u , or if the technical system to be controlled is based on the Action passes into states in which a large amount of information can be expected.

Somit ist die Relation zwischen den A-Werten A_iu sehr ähnlich der der Q-Werte im Zusammenhang mit dem Q-Lernverfahren.Thus the relation between the A values A _{iu is} very similar to that of the Q values in connection with the Q learning process.

Im folgenden wird mit â_iu der sofortige Informationsgewinn bezeichnet, der aus einer einzigen Ausführung der Aktion A_u in dem Zustand X_i resultiert.In the following, â _iu denotes the immediate information _gain which results from a single execution of the action A _u in the state X _i .

Anschließend wird ein geschätzter A-Wert Â_iu abgeleitet, mit dem der erwartete sofortige Informationsgewinn bezeichnet wird, der resultiert aus zukünftigen Ausführungen der Aktion A_u in dem Zustand X_i.An estimated A value Â _{iu is then} derived, which denotes the expected immediate information gain that results from future executions of the action A _u in the state X _i .

Schließlich wird eine Gesamt-Attraktivität A_iu auf der Grundlage der Â_iu in rekursiver Weise ermittelt.Finally, an overall attractiveness A _{iu is} determined recursively based on the Â _iu .

Der sofortige Informationsgewinn kann durch die Menge an Wissen gemessen werden, die das lernende System über die Zustandsübergangs-Wahrscheinlichkeiten zwischen den Fuzzy- Partitionen aufgrund einer Beobachtung eines Zustandsübergangs erhält.The instant information can be gained by the amount of Knowledge that the learning system is measured about State transition probabilities between the fuzzy Partitions due to observation of a Receives state transition.

Eine maximale Änderung
A maximum change

in den Zustandsübergangs-Wahrscheinlichkeiten von einem Zustand X_i und einer Aktion A_u, die aufgrund eines beobachteten Zustandsübergangs (x _k, a _k, x _k+1, g_k) resultieren, ist gegeben durch die Zugehörigkeit von (x _k, a _k) zu den einzelnen Fuzzy-Partitionen, bezeichnet durch:
in the state transition probabilities of a state X _i and an action A _u , which result from an observed state transition ( x _k , a _k , x _{k + 1} , g _k ), is given by the membership of ( x _k , a _k ) to the individual fuzzy partitions, designated by:

Auf diese Weise wird die Änderung der Wahrscheinlichkeiten mit einer oberen Grenze, die gebildet wird gemäß µ ℵ|i(x _k)µ A|u(a _k) kaliert, um das Maß des sofortigen Informationsgewinns unabhängig zu machen von der Position von (x _k, a _k) innerhalb der jeweiligen Fuzzy-Partition.In this way, the change in the probabilities is calibrated with an upper limit, which is formed according to µ ℵ | i ( x _k ) µ A | u ( a _k ), in order to make the measure of the immediate information gain independent of the position of ( x _k , a _k ) within the respective fuzzy partition.

Somit ergibt sich für die Aktualisierung des sofortigen Informationsgewinns von einer Iteration k zu der nächsten Iteration k+1:
Thus, for the update of the immediate information gain from one iteration k to the next iteration k + 1:

Aus den gemäß Vorschrift (115) ermittelten sofortigen Informationsgewinnen aufgrund Durchführen der Aktion A_u in dem Zustand X_i ist es möglich, Schlussfolgerungen hinsichtlich zu erwartender zukünftiger Informationsgewinne zu ziehen.From the immediate information gains determined in accordance with regulation (115) due to the execution of the action A _u in the state X _i , it is possible to draw conclusions regarding expected future information gains.

Es hat sich als vorteilhaft herausgestellt, eine gewichtete Summe aller vorangegangenen ermittelten sofortigen Informationsgewinne zu berechnen.It has proven to be advantageous to use a weighted one Sum of all previous immediately determined Calculate information gains.

Der Einfluss eines Informationsgewinns für einen Zustand X_i und einer Aktion A_u auf die sofortige "Attraktivität" sollte durch die Zugehörigkeit des entsprechenden Zustandsübergangs in die jeweilige Fuzzy-Partitionen beschränkt werden.The influence of an information gain for a state X _i and an action A _u on the immediate "attractiveness" should be limited by the affiliation of the corresponding state transition into the respective fuzzy partitions.

Dies kann dadurch erreicht werden, dass vorangegangene Informationsgewinne entsprechend der Summe der Grade der Zugehörigkeiten nachfolgender Beobachtungen gewichtet werden:
This can be achieved by weighting previous information gains according to the sum of the degrees of affiliation of the following observations:

Im folgenden Algorithmus wird die sofortige Attraktivität beschrieben als ein Quotient der gewichteten Summe der sofortigen Informationsgewinne und der Summe der Gewichte, das heißt die sofortige Attraktivität Â ergibt sich gemäß folgender Vorschrift:
In the following algorithm, the immediate attractiveness is described as a quotient of the weighted sum of the immediate information gains and the sum of the weights, that is, the immediate attractiveness results according to the following rule:

Eine totale Attraktivität _iu eines Zustand-Aktions-Paars (X_i, A_u) wird auf rekursive Weise gemäß folgender Vorschrift ermittelt:
A total attractiveness _{iu of} a state-action pair (X _i , A _u ) is determined recursively according to the following rule:

mit dem räumlichen Dämpfungsfaktor λ ∈ [0; 1] und der Attraktivität _j der Partitions-Untermenge X_j, gegeben gemäß folgender Vorschrift:
with the spatial damping factor λ ∈ [0; 1] and the attractiveness _{j of} the partition subset X _j , given according to the following rule:

Zusammenfassend kann die Explorationsstrategie durch folgende, in einem Pseudo-Code dargestellte Vorgehensweise beschrieben werden:In summary, the exploration strategy can following procedure shown in a pseudo code to be discribed:

1. Initialization

a) N 0 | iu ← 0, i = 1,. . ., N ℵ , u = 1 ,. . ., N A
b) M 0 | iuj ← 0, i = 1,. . ., N ℵ , u = 1 ,. . ., N A, j = 1,. . ., N ℵ
c) Initialize the components of the immediate attractiveness as if the maximum immediate information gain with the maximum degree of membership had been achieved in each previous iteration:
Each state-action pair (X i , A u ) is thus initialized with the maximum immediate attractiveness Â iu = 1.
d) Initialize total attractiveness
e) Determine initial state x 0

2. Main program

for k = 0, 1, 2, . . . do
for k = 0, 1, 2,. . . do

a) Let A u k be the partition subset (fuzzy action) of the action space, in which the attractiveness u ( x k ) is maximized in the current state x k , the attractiveness u ( x k ) being given by
if x is X i then u ( x ) = iu
That is, the following applies:
Randomly selecting an action a k , from
b) Execute action a k and observe the successor state x k + 1 and the profit g ( x k , a k , x k + 1 )
c) Executing an iteration of any reinforcement Learning process, for example the F- PS learning process or the F-Q learning process
d) for i = 1,. . ., N ℵ do

a) Counting the state transitions:

b) Calculating the immediate information gain resulting from the state transition:

c) Recalculating the immediate attractiveness for (X i , A u k ):

d) Re-estimate the state transition probabilities:
or
e) for i = 1,. . ., N ℵ do

a) Calculate the priority of saving for (X i , A u k ):

b) if P <Φ k then add (i, u k ) to PQueue with priority P fi
f) while PQueue ≠ empty do

a) (i, u) ← first (PQueue)

b) for all predecessors (l, w) of i, ie all (l, w) with M 0 | lwi <0 do

Zusammenfassend wird das oben beschriebene Verfahren noch einmal anhand Fig. 1 erläutert.In summary, the method described above is explained again with reference to FIG. 1.

In einem ersten Schritt werden Daten über das technische System, bei einem Verkehrsnetz 200 die jeweilige Verkehrsdichte an einem Sensorpunkt mittels eines Sensors, ermittelt (Schritt 101).In a first step, data about the technical system, in the case of a traffic network 200, the respective traffic density at a sensor point is determined by means of a sensor (step 101 ).

In einem weiteren Schritt werden Fuzzy-Partitionen des Zustandsraums und/oder des Aktionsraums ermittelt (Schritt 102).In a further step, fuzzy partitions of the state space and / or the action space are determined (step 102 ).

In einem weiteren Schritt wird ein Reinforcement- Lernverfahren durchgeführt unter Verwendung der ermittelten Daten über das technische System sowie unter Verwendung der ermittelten Fuzzy-Partitionen (Schritt 103).In a further step, a reinforcement learning process is carried out using the determined data on the technical system and using the determined fuzzy partitions (step 103 ).

In einem weiteren Schritt (Schritt 104) wird auf die oben beschriebene Weise gemäß dem Reinforcement-Lernverfahren eine optimale Steuerungsstrategie ermittelt, das heißt, es wird ein optimaler Ausgangswert ermittelt, mit dem angegeben wird, welcher Rahmensignalwert für die jeweilige Iteration auszuwählen ist (Schritt 104).In a further step (step 104 ), an optimal control strategy is determined in the manner described above in accordance with the reinforcement learning method, that is to say an optimal starting value is determined, with which it is specified which frame signal value is to be selected for the respective iteration (step 104 ).

Wie in Fig. 1 weiter dargestellt ist, wird in einem weiteren Schritt (Schritt 105) der gemäß dem Reinforcement- Lernverfahren ermittelte optimale Rahmensignalplan ausgewählt, ausgelesen, und abhängig von dem Rahmensignalplan werden die Ampeln 214 an den jeweiligen Kreuzungen, das heißt allgemein das technische System, das gesteuert werden soll, unter Berücksichtigung der ausgewählten optimierten Steuerungsstrategie und dem ausgewählten Rahmensignalplan, gesteuert (Schritt 106).As is further illustrated in FIG. 1, in a further step (step 105 ) the optimal frame signal plan determined according to the reinforcement learning method is selected, read out and, depending on the frame signal plan, the traffic lights 214 at the respective intersections, i.e. generally the technical one System to be controlled, taking into account the selected optimized control strategy and the selected frame signal plan, controlled (step 106 ).

Es ist darauf hinzuweisen, dass die oben beschriebene Erfindung nicht auf die Steuerung von Ampeln in einem Verkehrsnetz beschränkt ist, sondern dass sich die Fuzzy- Partitionierung eines kontinuierlichen Zustandsraums und/oder eines kontinuierlichen Aktionsraums für ein beliebiges technisches System eignet, das mit einem kontinuierlichen Zustandsraum und/oder kontinuierlichen Aktionsraum beschrieben wird und mittels eines Reinforcement- Lernverfahrens gesteuert werden soll. It should be noted that the above Invention not on controlling traffic lights in one Traffic network is limited, but that the fuzzy Partitioning a continuous state space and / or a continuous action space for any technical system that works with a continuous State space and / or continuous action space is described and by means of a reinforcement Learning process should be controlled.

In diesem Dokument sind folgende Veröffentlichungen zitiert:
The following publications are cited in this document:

[1] H. Takagi und M. Sugeno, Fuzzy Identification of Systems and its Application to Modelling and Control, IEEE Transactions on Systems, Man and Cybernetics, Vol. 15, S. 116-132, 1985
[2] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, ISBN 0-306- 40671-3, 1981
[3] A. Moore und C. Atkeson, Efficient Memory Based Reinforcement-Learning: Efficient Computation with Prioritized Sweaping, Information Processing, Vol. 5, S. 263-270, 1992
[4] S. Davies, Multi Dimensional Triangulation and Interpolation for Reinforcement-Learning, Advances in Neural Information Processing Systems, NIPS'9, S. 1005- 1011, 1996[1] H. Takagi and M. Sugeno, Fuzzy Identification of Systems and its Application to Modeling and Control, IEEE Transactions on Systems, Man and Cybernetics, Vol. 15, pp. 116-132, 1985
[2] JC Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, ISBN 0-306- 40671-3, 1981
[3] A. Moore and C. Atkeson, Efficient Memory Based Reinforcement-Learning: Efficient Computation with Prioritized Sweaping, Information Processing, Vol. 5, pp. 263-270, 1992
[4] S. Davies, Multi Dimensional Triangulation and Interpolation for Reinforcement-Learning, Advances in Neural Information Processing Systems, NIPS'9, pp. 1005-1011, 1996

Claims

1. Method for computer-aided determination of a control strategy for a technical system,
in which the technical system is described with a continuous state space and an action space,
in which the state space has states that the technical system can assume,
in which the action space has actions which are carried out in order to produce a state transition from a previous state of the state space to a subsequent state of the state space,
in which an assessment of the state transition takes place,
in which a model of the technical system is determined using training data that describe the technical system by forming fuzzy membership functions with which at least the state space is described and
in which a reinforcement learning procedure is carried out using the fuzzy membership functions,
whereby a control strategy is determined for each state of the state space,
whereby the optimal actions of the action space are learned.

2. The method according to claim 1, where for each state of the state space and the corresponding actions of the action room a Q-value is determined as a control strategy.

3. The method according to claim 1 or 2, where in the conclusions of the fuzzy rules of the fuzzy Systems, which according to the reinforcement learning process linear terms are used.

4. The method according to any one of claims 1 to 3, in which the reinforcement learning process is carried out by selecting such actions during training that meet a specified criterion.

5. The method according to claim 4, where the criterion is an information gain about the conditional transition probabilities within of the reinforcement learning process.

6. The method according to any one of claims 1 to 5,
in which a frame signal plan is selected on the basis of the control strategy, and
in which control signals are transmitted to traffic lights of a traffic network on the basis of the selected frame signal plan.

7. The method according to any one of claims 1 to 6, where for each fuzzy partition in the state space and Action space an information gain is determined from earlier versions of fuzzy Actions belonging to the partition in the corresponding states has resulted.

8. The method according to claim 7, where from a gain of information a future one Profit is estimated.

9. The method according to any one of claims 1 to 8,
the counter is provided, with which the number of executions of actions in a state of the technical system and the number of state transitions from an initial state to a subsequent state is specified on the basis of the action up to the current iteration,
in which the values assigned to the counters are updated when a new state transition is determined as a function of the degree of belonging of the states or of the state transitions to the respective fuzzy clusters.

10. The method according to claim 9, where the state transition probabilities are within the reinforcement learning process depends on the counters be determined.

11. The method according to any one of claims 1 to 10, in which fuzzy partitions were formed at the beginning of the process by using an iterative process based on a given set of output partition subsets these are divided into several fuzzy partition Subsets or merged from multiple fuzzy Partition subsets into a fuzzy partition, depending on the determined training data.

12. The method according to any one of claims 1 to 10, where the fuzzy partitions according to the fuzzy C-means clustering process.

13. Fuzzy control device for determining a control strategy for a technical system, with a processor that is set up in such a way that the following steps can be carried out:

- the technical system is described with a continuous state space and an action space,
the state space has states that the technical system can assume,
the action space has actions which are carried out in order to generate a state transition from a previous state of the state space to a subsequent state of the state space,
- there is an assessment of the state transition,
- With training data that describe the technical system, a model of the technical system is determined by forming fuzzy membership functions with which at least the state space is described and
- Using the fuzzy membership functions, a reinforcement learning process is carried out, as a result of which a control strategy is determined for each state of the state space, whereby the respectively optimal actions of the action space are learned.

14. Computer-readable storage medium in which a computer program for determining a control strategy for a technical system, which, when executed by a processor, has the following method steps:

- the technical system is described with a continuous state space and an action space,
the state space has states that the technical system can assume,
the action space has actions that are carried out in order to generate a state transition from a previous state of the state space to a subsequent state of the state space,
- there is an assessment of the state transition,
- With training data that describe the technical system, a model of the technical system is determined by forming fuzzy membership functions with which at least the state space is described and
- Using the fuzzy membership functions, a reinforcement learning process is carried out, as a result of which a control strategy is ascertained for each state of the state space, whereby the respectively optimal actions of the action space are learned.

15. Computer program element for determining a control strategy for a technical system, which, when executed by a processor, has the following method steps:

- the technical system is described with a continuous state space and an action space,
the state space has states that the technical system can assume,
the action space has actions which are carried out in order to generate a state transition from a previous state of the state space to a subsequent state of the state space,
- there is an assessment of the state transition,
- With training data that describe the technical system, a model of the technical system is determined by forming fuzzy membership functions with which at least the state space is described and
- Using the fuzzy membership functions, a reinforcement learning process is carried out, as a result of which a control strategy is ascertained for each state of the state space, whereby the respectively optimal actions of the action space are learned.