EP1116172A2

EP1116172A2 - Method and configuration for determining a sequence of actions for a system which comprises statuses, whereby a status transition ensues between two statuses as a result of an action

Info

Publication number: EP1116172A2
Application number: EP99953714A
Authority: EP
Inventors: Ralf Neuneier; Oliver Mihatsch
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 1998-09-23
Filing date: 1999-09-08
Publication date: 2001-07-18
Also published as: JP2002525763A; US7047224B1; WO2000017811A3; WO2000017811A2

Abstract

The determination of a sequence of actions ensues in such a way that a sequence of statuses resulting from the sequence of actions is optimized with regard to a predetermined optimization function. The optimization function includes a variable parameter with which a risk can be set. Said risk comprises the resulting sequence of statuses with regard to a predetermined status of the system.

Description

description

Method and arrangement for determining a sequence of actions for a system which has states, a state transition between two states taking place on the basis of an action

The invention relates to a method and an arrangement for determining a sequence of actions for a system which has states, a state transition between two states taking place as a result of an action.

Such a method and such an arrangement are known from [1].

A financial market is described in [1] as an example of such a system, which has states.

The system is described as a Markov decision problem (Markov decision problem, MDP). A system which can be described as a Markov decision problem is shown in its structure in FIG.

At a time t, the system 201 is in a state x ^. The state x ^ can be observed by an observer of the system. Based on an action a - _j - from a set of possible actions in the state Xt, a ^ e A (x ^), the system goes with a certain probability into a subsequent state xt + i at a subsequent time t + 1.

This is symbolically represented by a loop in Fig. 2. An observer 200 takes observable quantities about the state x - (- true 202 and makes a decision about an action 203 with which he acts on the system 201. The system 201 is usually subject to a fault 205.

Furthermore, the observer 200 receives a profit r 204 ^r t = ^r ( ^x t ' ^a f ^x t + l) ^{e 9} *' ⁽ D

which depends on the action a-t 203 and the original state x ^ - at the time t and the subsequent state x +1 of the system at the subsequent time t + 1.

The profit r- ^ can assume a positive or negative scalar value, depending on whether the decision leads to a system development which is positive or negative with regard to a predefinable criterion, in [1] to an increase in capital or to a loss.

In a further time step, the observer 200 of the system 201 decides on a new action a - ^ + i etc. based on the observable variables 202, 204 of the subsequent state xt + i.

An episode of

Condition: ^x te X

Action: ^a te ^A ( ^x t)

Result: ^x t + l € X

Gain ^r t = r (x _t , a _t , ^x t + l) em

etc. describes a trajectory of the system, which is evaluated by a performance criterion that accumulates the individual gains r- ^ over the times t. In the case of a Markov decision problem, it is assumed in a simplistic manner that the state x- ^ and the action a-t- contain all information in order to reduce a transition probability p (xt + l | -) of the system from that

State x- ^ to describe the subsequent state xt + l.

Formally, this means:

p ( ^x t + l | ^x t ' ^κ > ^x 0' ^a t ' ^κ > ^a θ) = p ( ^x t + l | ^x t' ^a t) - (2) Xt i ^w ^r a transition probability to the following state + l designated at x and a given action with a given condition | p ^(x t ^'A t x t + l).

In the case of a Markov decision problem, future states of system 201 do not depend on states and actions that are further than a time step in the past.

The characteristics of a Markov decision problem are summarized below:

X set of possible states of the system, e.g. X = <R ^m ,

^A ( ^x t) set of possible actions in the state r (x -) -, a ^ -, xt + l) profit with expected value R (xt, at).

The goal is to determine a strategy based on observable variables, the variables referred to hereinafter as training data, i.e. a series of functions

^π = { ^μ o 'μi' ^κ ι I ^ T} 'O ⁾

which, at every point in time t, states each condition in an action, i.e. action

μ _t (x _t ) = ^a t (4)

depict.

Such a strategy is evaluated by an optimization function. The optimization function specifies the expected value of the gains accumulated over time for a given strategy π and a starting state xn.

As an example of a method of approximate dynamic programming, the so-called Q learning method is described in [1].

An optimal evaluation function V * (x) is defined by

V (x) = max V ^π (x) Vx e X (5) π

With

V ^π (x) = μt _* ^x t + l) | 0 = ^x (6)

where γ denotes a predefinable reduction factor, which is formed in accordance with the following regulation:

γ = (7)

1 + z

ze 91 ^" (8)

As part of the Q learning process, a Q evaluation function Q (xt, at) is formed for each pair (state xt, action at) in accordance with the following rule:

Q * (x _t , a _t = ∑ p ( ^χ t + ι | ^χ t ' ^a t) ^{• r} t +

X: e € XX

(9) Based on the tuple (xt, xt + l ' ^a t' ^r t), the Q values Q * (x, a) in the k + 1 iteration are adapted according to the following learning rule with a predetermined learning rate η ^ according to the following regulation:

^Q k + l ( ^x t 't) = i ¹ - ηk) θk ( ^x t' t) + ηjc + Y • ⁽ 10 ⁾

Usually, the so-called Q values Q * (x, a) are approximated for different actions a by a function approximator, for example a neural network or also a polynomial classifier, with a weight vector w which contains the weights of the function approximator.

A function approximator is understood to mean, for example, a neural network, a polynomial classifier or also a combination of a neural network with a polynomial classifier.

So the following applies:

Q ^* (x, a) _* QX; w ^a ). (11)

Changes in the weights in the weight vector w are based on a temporary difference dt, which is formed according to the following rule:

^d t ^: = ^r ( ^x t ^a t ' ^x t + l) + Y ^max + i ^{; w} k) - ^Q (xf ' ^w k ^fc J ⁽ 12 ⁾

For the Q learning method using a neural network, the following adaptation rule for the weights of the neural network results, which weights are contained in the weight vector w: w ^a t ₌ , at k + 1 = w, + η _k ^• d _t ^• VQ x _ti w ^a t (13)

The neural network, which represents the financial market system as described in [1], is trained using the training data, which describe information about previous price developments of a financial market as time series values.

Another approximate dynamic programming method, the so-called TD (λ) learning method, is known from [2] and is explained in more detail in connection with an exemplary embodiment.

It is also known from [3] which risk is associated with a strategy π and an initial state x. A method for risk avoidance is also known from [3].

In the method known from [3], the following optimization function, which is also referred to as an extended Q function Q ^π ( ^x t ' ^a t), is used:

maximize

Q ^π ( ^x t ' ^a t = r (x _t , a _t , x _{t +} l) + π (x _k ), x _{k +} ι)

(14)

The extended Q function Q ^π (xt, t) describes the worst case if the action at is carried out in the state xt and the strategy π is then followed.

The optimization function Q ^π (x _f ^a t) f ^or Q * (x _t , a _t ): = max Q ^π (x _t , a _t ) π hurry

(15) -

is given by the following regulation:

^Q * ( ^x t ' ^a t) =: i β)

A major disadvantage of this approach is the fact that only the worst case is considered in the context of strategy finding. However, this only insufficiently reflects the requirements of a wide variety of technical systems.

From [4] it is also known to formulate access control for a communication network and routing within the communication network as a problem of dynamic programming.

The invention is therefore based on the problem of specifying a method and an arrangement for determining a sequence of actions for a system in which or in which an increased flexibility in determining the strategy is achieved.

The problem is solved by the method and by the arrangement according to the features of the independent claims.

In a method for the computer-aided determination of a sequence of actions for a system which has states, a state transition between two states taking place as a result of an action, the sequence of actions is determined in such a way that a sequence of states resulting from the sequence of actions takes place a given optimization function is optimized, the optimization function contains a variable parameter with which a risk which has the resulting sequence of states with respect to a predetermined state of the system can be set.

An arrangement for determining a sequence of actions for a system which has states, a state transition between two states taking place as a result of an action, has a processor which is set up in such a way that the sequence of actions can be determined in such a way that a a sequence of states resulting from the sequence of actions is optimized with regard to a predetermined optimization function, the optimization function containing a variable parameter with which a risk which the resulting sequence of states has with respect to a predetermined state of the system can be set.

The invention makes it possible for the first time to specify a method for determining a sequence of actions with freely definable accuracy as part of a strategy for a possible regulation or control, in general influencing the system.

Preferred developments of the invention result from the dependent claims.

The further developments described below apply to both the method and the arrangement, the processor being set up in such a way that the further development can be implemented in the further development of the arrangement.

In a preferred embodiment, a method of approximate dynamic programming is used for the determination, for example a method based on Q learning or also a method based on TD (λ) learning. As part of Q learning, the OFQ optimization function is preferably formed in accordance with the following regulation:

OFQ = Q (X; w ^a ),

being with

X a state in a state space X,

• a an action from an action space A, aa • ww ddiiee zuzuurr Aktkt a associated weights of a function approximator

is / will be designated.

As part of Q learning, the following adaptation step is carried out to determine the optimal weight w of the function approximator:

^w t + l = ^w t ^t + t ^• K ^K (d _t ) ^• VQx _t ; w ^ J

with the abbreviation

^d t = ^r ( ^x t> ^a t ' ^x t + l) + Y ^{max Qx} t +1' t) - θxt. w ^St J a € A

being with

X, Xt _{+ l} each a state in the state space X,

• at an action from an action area A,

• γ a predeterminable reduction factor, • w _t ^a t the weight vector belonging to the action at before the adaptation step,

• _t ^a t ₊ - _j _ the weight vector belonging to the action at after the

Adaptation step,

• η (t = 1, "...) A predeterminable step size sequence, • K € [-1; 1] a risk control parameter,

• ^κ a risk monitoring function ^κ (ξ) = (l - κsign (ξ)) ξ,

• VQ (- ;-) deriving the Funktionsapproximators to its weights, r (xt, t, ^χ t + l) ^e i ⁿ gain in state transition from the state xt after the subsequent state xt + l /

is / will be designated.

In the context of the TD (λ) learning method, the optimization function is preferably formed in accordance with the following regulation:

being with

X a state in a state space X,

A an action from an action area A,

• w the weights of a functional approximator

is / will be designated.

In the context of TD (λ) learning, the following adaptation step is carried out to determine the optimal weights w of the functional approximator:

w _{t +} l = w _t + η _t ^{■ κ} (d _t ) ^• z _t

with the abbreviations

^d t = ψt- ^a t ' ^χ t + ι) + Y ^J ( ^x t + ι ^{; w} t) - ^j (xf "w _t ),

z _t = λ ^• γ • z _t _ι + Vj (x _t ; w _t ),

z_ι = 0, being with

xt, xt + l each a state in the state space X, at an action from an action space A, γ a predefinable reduction factor, wt the weight vector before the adaptation step, wt + i the weight vector after the adaptation step, ηt (t = 1, .. .) a predefinable sequence of steps, K e [-1; 1] a risk control parameter, ^κ a risk control function K ^K (ξ) = (l - κsign (ξ)) ξ, Vj (V) the derivation of the function approximator according to its weights, r (xt, at, xt + l) ^e l ⁿ profit in the state transition from state x to the subsequent state xt + l,

is / will be designated.

The system is preferably a technical system, from which measured variables are measured before the determination, which are used in determining the sequence of actions.

The technical system can be controlled or regulated using the determined sequence of actions.

The system is preferably modeled as a Markov decision problem.

The method or the arrangement are preferably used in a traffic control system or in a communication system, the sequence of actions for carrying out access control or routing, that is to say path assignment, being used in a communication network in the communication system.

Furthermore, the system can be a financial market which is modeled by a Markov decision problem and where the course of the financial market, for example a course of a Stock index or a price trend of a foreign exchange market can be analyzed using the methods or the arrangement and can be intervened in the market in accordance with the sequence of determined actions.

Embodiments of the invention are shown in the figures and are explained in more detail below.

Show it

Figure 1 is a flowchart showing individual process steps of the first embodiment;

FIG. 2 shows a sketch of a system which can be modeled as a Markov decision problem;

FIG. 3 shows a sketch of a communication network in which access control is carried out in a switching unit;

FIG. 4 shows a symbolic sketch of a function approximator with which a method of approximate dynamic programming is implemented;

FIG. 5 shows a further sketch of a number of function approximators, with which an approximate dynamic programming is implemented;

Figure 6 is a sketch of a traffic control system, which is controlled according to an embodiment. First embodiment: access control and routing.

3 shows a communication network 300 which has a multiplicity of switching units 301a, 301b, ..., 301i, ... 301n which are connected to one another via connections 302a, 302b, 302j, ... 302m.

Furthermore, a first terminal 303 is connected to a first switching unit 301a. A request message 304 is sent from the first terminal 303 to the first switching unit 301a, with which a reservation of a predetermined bandwidth within the communication network 300 for the transmission of data (video data, textual data) is requested.

In the first switching unit 301a, a strategy described below is used to determine whether the requested bandwidth is available in the communication network 300 on a specified, requested connection (step 305).

If this is not the case, the request is rejected (step 306).

If sufficient bandwidth is available, a further check step (step 307) checks whether the bandwidth can be reserved.

If this is not the case, the request is rejected (step 308).

Otherwise, the first switching unit 301a selects a route from the first switching unit 301a via further switching units 301i to a second terminal 309 with which the first terminal 303 wants to communicate, and a connection is initialized (step 310). In the following, a communication network 300 is assumed which comprises a set of switching units

N = {l, K, n, K, N} (17) and a set of physical connections

L = {l, K, 1, K, L}, (18)

comprises, wherein a physical connection 1 has a capacity of B (l) bandwidth units.

It is a sentence

M = {l, K, m, K, M} (19)

different service types m available, one service type m by

A bandwidth requirement b (m),

• an average connection duration -, and v (m) • a gain c (m), which is obtained when a connection request of the corresponding service type m is accepted.

The profit c (m) is given by the amount of money that a network operator of the communication network 300 charges a subscriber for a connection of the service type. The profit c (m) clearly reflects different priorities which can be specified by the network operator and which he associates with different services.

A physical connection 1 can simultaneously provide any combination of communication connections as long as the bandwidth used for the communication connections does not exceed the total available bandwidth of the physical connection. If a new communication connection of type m is requested between a first node i and a second node j (terminals are also referred to as nodes), the requested communication connection can, as shown above, either be accepted or rejected.

If the communication connections are accepted, a route is selected from a set of predefined routes. This selection is called routing. In the context of the type m communication connection, b (m) bandwidth units are used for each physical connection along the selected route for the connection duration.

A route within the communication network 300 can therefore only be selected as part of the access control (call admission control) if the selected route has sufficient bandwidth available.

The goal of access control and routing is to maximize long-term gain that is obtained by accepting the requested connections.

The technical system communication network 300 is in a state xt at a point in time t, which is described by a list of routes via existing connections, by means of which lists it is shown how many connections and which service type use the respective route at the point in time t.

Events w, through which a state xt could be converted into a subsequent state xt + i, are the arrival of new connection request messages or the termination of a connection existing in the communication network 300.

In this exemplary embodiment, an action at at a time t based on a connection request is the a decision as to whether to accept or reject a connection request and, if the connection is accepted, to select the route through the communications network 300.

The aim is to determine a sequence of actions, i.e. vividly determining the learning of a strategy with actions for a state x in such a way that the following rule is maximized:

^{• g} (xt _k ' ^ω k' t _k 0) (20)

being with

• E {.} An expected value, • t _k a point in time at which a kth event occurs,

• fxt _b - ' ^ω k' ^a t) the gain associated with the kth event and

A reduction factor that values an immediate profit more valuable than a profit in distant future times,

referred to as.

Different implementations of a strategy usually lead to different total profits G:

co G = ∑e- ^ k • g (xt _k .ω _k , a _tk ). (21) k = 0

The goal is to maximize the expected value of total profit G according to the following regulation J:

a risk that the total profit G of a special implementation of an access control and a routing strategy falls below the expected value can be set.

The TD (λ) learning method is used to perform access control and routing.

The following target function is used in the context of this exemplary embodiment:

^{J *} ( ^x t) = ^E τ ^e ß ^τ E _f max g (x _t , ω _t , a) + J ^* (x _t + l) (23) l ^ι laeAl J

being with

A an action space with a predetermined number of actions which are available in a state xt,

• τ a first point in time, and a first event ω takes place,

X + l a subsequent state of the system,

referred to as.

An approximated value of the target value J * (xt) is learned and stored using a function approximator 400 (see FIG. 4) using training data.

Training data are previously measured data in the communication network 300 about the behavior of the communication network 300 when incoming connection requests 304 and when messages are terminated. This chronological sequence of states is stored and the functional approximator 400 is trained using this training data in accordance with the learning method described below. A number of connections each of a service type m on a route of the communication network 300 are used as the input variable of the function approximator 400 for each input 401, 402, 403 of the function approximator 400. These are represented symbolically in FIG. 4 by blocks 404, 405, 406.

The output variable of the function approximator 400 is an approximated target value J of the target value J.

A more detailed illustration of the function approximator 500, which in this case has several partial

5 shows functional approximators 510, 520 of the functional approximator 500. An output variable is the approximated target value J, which is formed in accordance with the following regulation:

The input variables of the subfunction approximators 510, 520, which are present at inputs 511, 512, 513 of the first subfunction approximator 510 or at inputs 521, 522 and 523 of the second subfunction approximator 520, are each a number of service types of a type m in each case in a physical connection r, symbolized by blocks 514, 515, 516 for the first partial function approximator and 524, 525 and 526 for the second partial function approximator 520.

Partial output variables 530, 531, 532, 533 are supplied to an adding unit 540 and the approximated target variable J is formed as the output variable of the adding unit.

Assume that the communication network 300 is in the state x ^, and a request message with the one

Service type m of class m for a connection between two Node i, j is requested arrives at the first connection unit 301a.

With R (i, j) a list of permitted routes between the nodes i and j is designated and with

R (i, j, x _{t] c} ) c R (i, j) (25)

a list of all possible routes is referred to as a subset of the routes R (i, j) that could implement a possible connection with regard to the available and requested bandwidth.

For each possible route r, re Rli, j, x ^), a subsequent state xt _k + l ( ^x tι _<r ' ^ω k' ^r ) is determined, which results from the connection request 304 being accepted and the connection being opened the route r is made available to the requesting first terminal 303.

This is shown in FIG. 1 as a second step (step 102), the state of the system and the respective event being ascertained in a first step (step 101).

In a third step (step 103), a route r * to be selected is determined in accordance with the following rule:

r = arg _/ max _\ ⁵ ( ^x t _k + l ( ^x t 'ω _k , r), Θ _t J. (26) reRI (i, j, x _tk )

In a further step (step 104) it is checked whether the following requirement is met:

c (m) + Θ _t J < ^j (x _tk -Θ _t ). (27) If this is the case, the connection request 304 is rejected (step 105), otherwise the connection is accepted and "switched through" to the node j along the selected route r * (step 106).

In a parameter vector Θ, weights of the functional approximators 400, 500 are stored for a time t, which are adapted to the training data as part of the TD (λ) learning method, so that an optimized access control and an optimized routing is achieved.

During the training phase, the weight parameters are adapted to the training data created in the function approximator.

A risk parameter K is defined, by means of which a desired risk, which is due to a sequence of actions and states with regard to a predetermined state of the system, can be set, in accordance with the following regulations:

-1 <K <0: risky learning,

K = 0: neutral learning regarding the risk,

0 <K <1: risk-avoiding learning,

K = 1: "Worst case" learning.

Furthermore, a specifiable parameter 0 <λ <1 and a step size sequence γ _{k are} specified as part of the learning process.

The weight values of the weight vector Θ are adapted to the training data based on each event ωt _{k in} accordance with the following adaptation rule:

®k = Θ _k _ι + rkN ^κ (d _k ) z _t , (28) in which

d _k = e ßO ^tk - ¹ ) (g (x _tk , ω _k , a _t] ) + ^j (x _tk . ®kl)) " ⁵ ( ^x t _k _ © kl)

(29)

z _t = λe ^ kl ^t k-2) _Zt _ ₁ + V _Θ j (x _tk _ ₁ , Θ _k _ ₁ ), (30)

and

K ^κ (ξ) = (l - κsign (ξ)) ξ. (31)

It is assumed: Z _] _ = 0.

The function

g ( ^x t _k ' ^ω k' ^a t) ⁽ 32 ⁾

means immediate profit according to the following rule:

c (m) if ωti is a service requirement

Service type m and the connection becomes g (t 'ω _k , a _t

*) - accepts 0 otherwise

(33)

As described above, a sequence of actions is thus determined with regard to a connection request, so that a connection request based on an action is either rejected or accepted. The determination is made taking into account an optimization function in which the risk is determined by means of a risk control parameter K e [-1; 1] is variably adjustable. Second embodiment: traffic management system

FIG. 6 shows a street 600 which is used by cars 601, 602, 603, 604, 605 and 606.

Conductor loops 610, 611 integrated in the street 600 receive electrical signals in a known manner and feed the electrical signals 615, 616 to a computer 620 via an input / output interface 621. In an analog / digital converter 622 connected to the input / output interface 621, the electrical signals are digitized in a time series and in a memory 623, which is connected via a bus

624 with the analog / digital converter 622 and a processor

625 is connected. Via the input / output interface 621, a traffic control system 650 is supplied with control signals 651, from which a predefined speed setting 652 can be set in the traffic control system 650 or also further information from traffic regulations which is transmitted to the drivers 601, 602, 603, 604, via the traffic control system 650. 605 and 606 are shown.

In this case, the following local state variables are used for traffic modeling:

• traffic flow velocity v, • vehicle density p (p = number of vehicles per kilometer

Fz ter -), km

Vehicle

• Traffic flow q (q = number of vehicles per hour -, h

(q = v * p)), and

• Speed limits 652 displayed by the traffic control system 650 at a time.

The local state variables are measured as described above using the conductor loops 610, 611. These variables (v (t), p (t), q (t)) thus represent a state of the technical system "traffic" at a specific time t.

In this exemplary embodiment, the system is thus a traffic system which is regulated using the traffic control system 650.

In this second exemplary embodiment, an extended Q learning method is described as a method of approximate dynamic programming.

The state xt is described by a state vector

x (t) = (v (t), p (t), q (t)). (34)

The action at denotes the speed limit 652, which is displayed by the traffic control system 650 at time t.

The gain r (xt, at, xt + l) describes the quality of the traffic flow that was measured by the conductor loops 610 and 611 between the times t and t + 1. In the context of this second exemplary embodiment, r denotes (xt, at, xt + l)

The average speed of the vehicles in the time interval [t, t + 1],

or

The number of vehicles which have passed conductor loops 610 and 611 in the time interval [t, t + 1],

or

• The variance of the vehicle speeds in the time interval [t, t + 1], ^~ or

• a weighted sum of the above sizes.

For every possible action at, i.e. For each speed limit that can be displayed by the traffic control system 650, a value of the optimization function OFQ is determined, with an estimated value of the optimization function OFQ being implemented as a neural network.

This results in a set of evaluation variables for the different actions at in the system state ^x t-

In a regulatory phase, the possible actions at, i.e. From the set of speed limits that can be displayed by the traffic control system 650, that action a is selected for which the maximum evaluation variable OFQ has been determined in the current system state Xt.

The adaptation rule known from the Q learning method for calculating the optimization function OFQ is expanded according to this exemplary embodiment by a risk control function ^K Q, which takes the risk into account.

Again, the risk control parameter K is specified according to the strategy from the first exemplary embodiment in the interval of [-1 <K <1] and represents the risk that a user wants to take in the context of the application with regard to the control strategy to be determined.

According to this exemplary embodiment, the following evaluation function OFQ is used:

OFQ == QQ (Xx ;; ww ^a '), (35) being with

• x = (v; p; q) a state of the traffic system,

A a speed limit from the action space A of all speed limits that can be displayed by the traffic control system 650,

W the weights of the neural network belonging to the speed limit a,

is / will be designated.

As part of Q learning, the following adaptation step is carried out to determine the optimum weights w of the neural network:

^w t + l = ^w t ^fc + ηt ^• K ^K (d _t ) ^• Vθ (x _t ^; wt ^fc ) ⁽ 36 ⁾

with the abbreviation

d _t = r (x _t , a _t , xt + l) + Y max Qx _t + ι, wξ - Q ^ xt, w ^ J (37) aeA

being with

X, xt ₊ l each state of the traffic system according to regulation (34),

• at an action, i.e. a speed limit that can be displayed by the traffic control system 650,

• γ a predefinable reduction factor, • ^a t the weight vector belonging to action a before the adaptation step, ^a t

• t + l ^c * ^he weight vector belonging to the action at after the

Adaptation step,

• Ηt (t = 1, _, ..) a predefinable sequence of steps, • K e [-1; 1] a risk control parameter,

^• κ a risk control function N ^K (ξ) = (l - κsign (ξ)) ξ,

• VQ (- ;-) the derivation of the neural network according to its weights, • r (xtr at, xt + l) a gain in the state transition from the state xt to the subsequent state xt + _{l ^}

is / will be designated.

In the course of learning, an action at can be chosen at random from the possible actions at. It is not necessary to choose the action at which has led to the largest assessment variable.

The weights have to be adapted in such a way that not only is a regulation of the traffic optimized in the expected value of the optimization function is achieved, but also a variance of the regulation results is taken into account.

This is particularly advantageous since the state vector x (t) models the actual traffic system only inadequately in some aspects and therefore unexpected disturbances can occur. The dynamics of traffic and thus its modeling depend on other factors such as the weather and the share of trucks on the road

Proportion of mobile homes, etc., which are not always integrated in the measured variables of the state vector x (t). In addition, it is not always ensured that the road users immediately follow the new speed information in accordance with the traffic management system.

A regulation phase on the real system according to the traffic control system takes place according to the following steps:

The state xt at time t is measured at various points in the traffic system and results in a state vector x (t): = (v (t), p (t), q (t)). A value of the optimization function is determined for all possible actions at and the action at with the highest rating is selected in the optimization function.

The following publications are cited in this document:

[1] R. Neuneier, Enhancing Q-Learning for Optimal Asset Allocation, Proceedings of the Neural Information Processing Systems, NIPS 1997

[2] R.S. Sutton, Learning to predict by the method of temporal differences, Machine Learning, 3: 9-44, 1988

[3] M. Heger, Risk and Reinforcement Learning: Concepts and Dynamic Programming, ZKW Report No. 8/94, Center for Cognitive Sciences, University of Bremen, ISSN 0947-0204, December 1994

[4] D.P. Bertsekas, Dynamic Programming and Optimal Control, Athena Scientific, Belmont, MA, 1995

Claims

claims

1. A method for computer-aided determination of a sequence of actions for a system which has states, a state transition between two states taking place as a result of an action in which the sequence of actions is determined in such a way that a sequence of actions resulting from the sequence of actions States is optimized with regard to a predetermined optimization function, the optimization function containing a variable parameter with which a risk which the resulting sequence of states has with respect to a predetermined state of the system can be set.

2. The method according to claim 1, in which a method of approximate dynamic programming is used for the determination.

3. The method according to claim 2, wherein the method of approximate dynamic programming is a method based on Q learning.

4. The method as claimed in claim 3, in which the optimization function OFQ is formed as part of Q learning in accordance with the following regulation:

OFQ = Q (X; w ^a ),

being with

x a state in a state space X, a an action from an action space A, aa

• ww ddiiee zuzuAktkt a appropriate weights of a functional approximator

is / are referred to, and in which the weights of the functional approximator are adapted according to the following regulation: ^w ₊ ι = «t * ⁺ ηt ^• κ ^κ (d _t ) ^• Vθ (x _t ^; w ^)

with the abbreviation

dt = ^r ( ^x t ' ^a t / ^x t + l) + Y ^{ax Q χ} t + 1' ^a " ^Q [ ^χ t ' ^3t 1 aeA ^ι

being with

X, xt ₊ l each a state in the state space X,

• at an action from an action area A,

• γ a predefinable reduction factor, a + -

• w _{t is} the weight vector associated with action a before the adaptation step, • w _tt ++ ddeerr zzuurr AA ^' ktion at associated weight vector after the adaptation step,

• η (t = 1, ...) a predeterminable step size sequence, • K G [-1; 1] a risk control parameter,

• ^κ a risk control function ^κ (ξ) = (l - κsign (ξ)) ξ, • VQ (- ;-) the derivation of the function approximator according to its weights, • r (xt, at, xt + l) ^e l ⁿ profit at State transition from the state xt to the subsequent state xt + l /

is / will be designated.

5. The method of claim 2, wherein the method of approximate dynamic programming is a method based on TD (λ) learning.

6. The method according to claim 5, wherein the optimization function OFTD in the context of the TD (λ)

Learning is formed according to the following regulation:

being with

X a state in a state space X,

A an action from an action area A,

• w the weights of a functional approximator

is / are referred to, and in which the weights of the functional approximator are adapted according to the following regulation:

^w t + l = ^w t + ηt * ^κ (d _t ) ^{• z} t

with the abbreviations

dt = ψ _t , a _t . xt + l) + Y ^J ( ^x t + l '' ^w t) " ^j (xf" ^w t).

z _t = λ ^• γ • z _t _ι + Vj (x _t ; w _t ),

^z -l = 0

being with

xt, xt + _l each a state in the state space X, at an action from an action space A, γ a predefinable reduction factor, wt the weight vector before the adaptation step, wt + i the weight vector after the adaptation step, ηt (t = 1, ... ) a definable sequence of steps,

K e [-1; 1] a risk control parameter, ^κ a risk control function K ^K (ξ) = (l - κsign (ξ)) ξ,

Vj (- ;-) the derivation of the function approximator according to its

Weights, r (xt, at, xt + i) a gain in the state transition from state x to the subsequent state xt + _l is / will be designated.

7. The method according to any one of claims 1 to 6, wherein the system is a technical system from which measured variables are measured before the determination, which are used in determining the sequence of actions.

8. The method according to claim 7, wherein the technical system is controlled according to the sequence of actions.

9. The method according to claim 7, in which the technical system is regulated according to the sequence of actions.

10. The method according to any one of claims 1 to 9, wherein the system is modeled as a Markov decision problem.

11. The method according to any one of claims 1 to 10, used in a traffic control system.

12. The method according to any one of claims 1 to 10, used in a communication system.

13. The method according to any one of claims 1 to 10, used to carry out an access control in a communication network.

14. The method according to any one of claims 1 to 10, used to perform routing in a communication network.

15. Arrangement for determining a sequence of actions for a system which has states, a state transition between two states taking place as a result of an action, with a processor which is set up in such a way that the sequence of actions can be determined in such a way that a sequence of states resulting from the sequence of actions is optimized with regard to a predetermined optimization function, the optimization function containing a variable parameter with which a risk which the resulting sequence of states has with regard to a predetermined state of the system can be set.

16. The arrangement according to claim 15, used to control a technical system.

17. The arrangement according to claim 15, used for controlling a technical system.

18. Arrangement according to claim 15, used in a traffic control system.

19. The arrangement according to claim 15, used in a communication system.

20. The arrangement according to claim 15, used to carry out an access control in a communication network.

21. The arrangement according to claim 15, used for performing routing in a communication network.