CN110083063B

CN110083063B - Multi-body optimization control method based on non-strategy Q learning

Info

Publication number: CN110083063B
Application number: CN201910352788.5A
Authority: CN
Inventors: 李金娜; 肖振飞
Original assignee: Liaoning Shihua University
Current assignee: Liaoning Shihua University
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2022-08-12
Anticipated expiration: 2039-04-29
Also published as: CN110083063A

Abstract

Based on non-strategyQThe invention discloses a learning multi-body optimization control method, relates to an optimization control method, and provides a non-strategy aiming at the problems of discrete linear non-zero sum gameQAnd (5) learning an algorithm. First, non-zero and game optimization problems are presented and the value functions defined from individual performance indicators are strictly proven to be linear quadratic. Then, based on the dynamic planning sumQLearning method, giving out non-strategyQAnd (4) learning algorithm to obtain the approximate optimal solution of non-zero sum game and realize the global Nash balance of the system. Finally, the validity of the simulation verification method is calculated. The invention is used for solving the problems of multiple non-zero and game of the linear discrete system, and the effectiveness of the algorithm is verified by simulation; the invention relates to game theory and non-strategy

The learning method is integrated, and a non-strategy is provided under the framework of non-zero game and game

Learning algorithm, learning optimal control strategy, realizing global Nash of the whole system

And (4) equalizing.

Description

Multi-body optimization control method based on non-strategy Q learning

Technical Field

The invention relates to an optimization control method, in particular to a multi-body optimization control method based on non-strategy Q learning.

Background

Adaptive Dynamic Programming (ADP) is a method for solving an approximate optimal solution, and is widely applied to the current optimal control. The method utilizes the solution of Hamilton-Jacobi-Bellman equation to solve the approximate optimal solution of the system through iteration. A large number of literature reports exist for researching the optimization control problem of the model-free system by adopting the self-adaptive dynamic programming. Researching and calculating the self-adaptive optimal control of a linear continuous time system of completely unknown system dynamics; researching the H-infinity control problem of the data-driven nonlinear distributed parameter system; researching a controlled adaptive dynamic programming algorithm and the stability thereof; researching data-driven strategy gradient self-adaptive dynamic programming optimal control; and researching the self-adaptive dynamic planning of the optimal tracking control of the unknown nonlinear system for coal gasification. Learning an optimal control strategy by adopting a reinforcement learning method is widely applied, the system performance is optimized, the control input constraint condition is met, certain indexes of the system performance are realized, and the feedback control under reinforcement learning and approximate dynamic planning is researched; researching feedback control based on reinforcement learning, and setting an adaptive optimal controller by using a natural decision method; researching linear quadratic tracking control of a continuous system partially unknown based on reinforcement learning; and researching the optimal tracking control of the unknown constrained input system based on the nonlinear part of the integral reinforcement learning.

An off-policy learning method is a learning method that can learn an optimal control strategy without depending on a model, using only collected system data in iterative update. It has three significant advantages compared to the strategy (on-policy) learning method: 1) the defect of insufficient exploration on the system is overcome; 2) no interference with the operation of the system during the learning process and no update of the interference input in a prescribed manner is required; 3) when solving the exact solution under the condition of satisfying the continuous excitation, there is no deviation even if the detection noise is added to the system input. It is noted that the literature adopts a strategy learning method to research the optimal control problem of the system. Documents that employ the non-policy learning method include: researching the H-infinity control problem of the model-free non-strategy reinforcement learning continuous system; researching the optimal operation control of the double-time-scale industrial process under the non-strategy reinforcement learning; researching the H-infinity control problem of a linear discrete system under the non-strategy reinforcement learning; researching the H infinite control design problem of non-strategy reinforcement learning; and (3) researching the optimal control of the affine nonlinear discrete system under the non-strategy interleaving Q learning. A multi-agent cooperative control system, or a dynamic system with multiple decision quantities and multiple control inputs, is widely available in modern social production. In the non-zero-sum game, each agent needs to adopt an optimal control strategy to optimize performance indexes by itself. Documents that solve the game problem by using a non-strategy learning method include: researching non-strategy reinforcement learning in multi-agent graph synchronous game; the optimal control problem of the unknown system of two individuals with disturbance under non-strategic learning is studied, and the like.

The learners already adopt a non-strategic Q learning method to research the optimal self-adaptive control strategy of a single system, so whether the non-strategic Q learning method can research the optimal control problem of a plurality of system games, and how to design the non-strategic Q learning method under the condition that a system model is unknown realizes the Nash of a plurality of systems

Equalization, which is the problem concerned by the present invention, has not been reported in the related literature.

The invention aims to provide a multi-body optimization control method based on non-strategy Q learning, and provides a non-strategy Q learning method which is used for solving the problems of multi-body nonzero and game of a linear discrete system and verifying the effectiveness of the method in a simulation way; the invention integrates game theory and non-strategy Q learning method, and provides the non-strategy under the framework of non-zero game

Learning optimal control strategies to achieve global Nash for the entire system

And (4) equalizing.

The purpose of the invention is realized by the following technical scheme:

a multi-body optimization control method based on non-strategy Q learning firstly provides a non-zero and game optimization problem, strictly proves that a value function defined according to individual performance indexes is a linear quadratic form, then provides a non-strategy Q learning algorithm based on a dynamic programming and Q learning method, obtains an approximate optimal solution of the non-zero and game, and realizes the global Nash balance of a system; the algorithm does not require that the parameters of the system model are known, and can completely utilize measurable data to learn Nash equilibrium solution; finally, calculating the effectiveness of the simulation verification method;

the method comprises the following specific steps:

1) firstly, describing discrete linear non-zero and game problems, and then proving that the value function of an individual is a linear quadratic form;

2) solving a non-zero sum game, and providing a non-strategy Q learning algorithm;

3) and (4) performing example simulation, namely performing program simulation by using the newly proposed algorithm to prove the effectiveness of the algorithm and the convergence of data.

According to the multi-body optimization control method based on non-strategic Q learning, the value function of the proved individual is a linear quadratic form, and the following linear discrete system equation is considered:

wherein

And

is a control input to the control unit,

is an adaptive matrix. Design state feedback controller

Minimizing each individual

Own performance index

，

Wherein the content of the first and second substances,

。

the multi-body optimization control method based on non-strategy Q learning comprises the steps of providing a model-free strategy Q learning algorithm,

the Q function matrix H in equation (19) is learned to obtain optimum control gains for a plurality of individuals.

According to the multi-body optimization control method based on non-strategic Q learning, the validity of the non-strategic Q learning algorithm is proved through the example simulation.

The invention has the advantages and effects that:

the invention integrates game theory and non-strategy Q learning method, provides the non-strategy Q learning method under the framework of non-zero and game, learns the optimal control strategy, and realizes the global Nash of the whole system

And (4) equalizing. Firstly, defining controllers of a plurality of intelligent agents through dynamic programming, and then obtaining a game Bellman equation based on a non-policy Q function

Obtaining a non-strategy Q learning method, and finally verifying the effectiveness of the method by an algorithm.

Drawings

FIG. 1 shows the convergence of H under the strategy Q learning method;

FIG. 2 shows K convergence under a strategy Q learning method;

FIG. 3 is a system state x of a first scenario under the strategy Q learning method;

FIG. 4 shows a system state x of a second scheme under the strategy Q learning method;

FIG. 5 shows a system state x of a third scenario under the strategy Q learning method;

FIG. 6 convergence of H under the non-strategic Q learning method;

FIG. 7 convergence of K under the non-strategic Q learning method;

FIG. 8 System State x under the non-strategic Q learning approach.

Detailed Description

The present invention will be described in detail with reference to examples.

1. Problem elucidation

The discrete linear non-zero and gambling problems are first explained and then the individual value functions are proven to be linear quadratic.

Consider the following linear discrete system equation:

wherein

And

is a control input to the control unit,

is an adaptive matrix. Design state feedback controller

Minimizing each individual

Own performance index

：

Wherein the content of the first and second substances,

。

problem 1: the performance indexes are as follows:

the constraint conditions are as follows:

respectively defining an optimal value function according to the performance index formula (3)

And the optimal Q function is:

and

thus, the relationship between the two is:

theorem 1: for game problem 1, if control is entered

Is an allowable control, the optimum function and the optimum Q function can be expressed as quadratic forms as follows:

and

wherein

，

And is

Certificate (certificate)

Wherein

. Further comprises

（11）

And

wherein the content of the first and second substances,

further, in the method, the number of the main components is more than one,

and the number of the first and second electrodes,

represented by the formula (6, 12, 13) is

Wherein, in the step (A),

after the syndrome is confirmed.

2. Solving non-zero sum games

The invention mainly provides a non-strategy Q learning method. It is well known that the basis of gaming is nash equilibrium.

Definition 1: nash equilibrium.

For all

All satisfy the following n inequalities

Then this n-ary policy group

Then the nash equilibrium of n-element limited game under the generalized form is formed. And N-tuple

It is the nash equilibrium result of this n-gram.

From equations (5) and (6), the following bellman equation based on the Q function is obtained according to the dynamic programming:

then, the game Bellman equation of the optimal Q function is subjected to partial derivation, and the optimal control gain of each individual can be obtained

。

It is possible to obtain:

will be in formula (18)

The Rika equation (26) is substituted to obtain the optimal Q function

The equation:

the following have been demonstrated in the literature

Ensuring that system formula (1) realizes Nash equilibrium.

（20）

Note 1: it can be seen from the equations (18) and (20) that neither the bellman equation nor the ricatt equation for the optimal Q function, in which the matrices H are coupled to each other and the K values are also coupled to each other, is well solved. Therefore, the strategy Q learning algorithm is given below.

2.1 strategy Q learning Algorithm

The following gives a model-free strategy Q learning algorithm, which learns the Q function matrix H in equation (19) to obtain the optimal control gains for multiple individuals.

Algorithm 1: strategy Q learning algorithm

1. Initialization: giving initial values of control gains for a plurality of bodies

. Wherein

，

Is an iteration index;

2. by solving in Q function

To perform policy evaluation:

3. and (3) policy updating:

wherein the control gain term of the ith individual can be expressed as:

（23）

4. when in use

Wherein

The iteration stops when it is a minimum value.

Note 2: in policy updating, find

And further can find

Thereby can find

. Namely:

when in use

When the time is about to be infinite,

tend to be

Which is

Converge on

。

Note 3: because the strategy Q learning algorithm has deviation, but the non-strategy Q learning algorithm has a plurality of advantages compared with the strategy Q learning algorithm, the deviation can be eliminated. Therefore, the following subsection proposes that the non-strategy Q learning algorithm has a plurality of advantages and can eliminate deviation. Therefore, the following subsection presents a non-strategic Q learning algorithm

A Q function-based non-policy algorithm is provided, and the algorithm is a model-free data-driven algorithm and is used for solving the problems of non-zero and game of multiple individuals.

From equation (21), it follows:

wherein the content of the first and second substances,

adding auxiliary variables to system equation (1)

It is possible to obtain:

in this formula

，

Is a behavior control strategy, used to generate data,

the target control strategy of the individual who needs to learn. When the state trajectory of the system is equation (27), it can be derived:

due to the fact that

And

there are corresponding relations (14) and (15). Further, it can be deduced that:

the method is simplified and can be obtained:

wherein the content of the first and second substances,

the above formula (29) can be rewritten as:

wherein the content of the first and second substances,

in the formula

，

And is

. Wherein in the formula

And is and

，

based on the above, can obtain

In the following form:

（32）

and 2, algorithm: non-strategy Q learning algorithm

2. Data acquisition: collecting and storing data

，

And

collecting and storing a sample set by data;

3. initialization: giving initial values of control gains for a plurality of bodies

And the system formula (1) must be stable. Wherein

，

Is an iteration index;

4. implementing a Q learning algorithm: using the data obtained in the first step, updating by iteratively solving the equation (31) through the algorithm

A value of (d);

5. if it is not

The process is stopped and the process is stopped,

is a very small value. If not, then,

and returns to the third step.

Note 4: the solution of equation (31) is equivalent to the solution of equation (21), and it is confirmed that it converges to the optimal solution

。

4. Example simulation

In this section, an example is given of a program simulation using the newly proposed algorithm to demonstrate the effectiveness of the algorithm and the convergence of the data.

Consider a more complex, linearly discrete system with two individuals playing a non-zero and game.

Sampling time

Selecting

And

selecting

. First, the most

Function equation and optimization

Corresponding to function equation

Matrix sum

Matrix, and optimal control gain for agent

And

in a

The true solution is solved by using an iterative algorithm depending on a system model.

It is well known that the addition of detection noise is necessary by ensuring sufficient excitation conditions, solving equation (16) accurately. The detection noise added by the invention has the following three types:

the first scheme is as follows:

scheme II:

the third scheme is as follows:

wherein

Giving control gains under three detection noises

And

table of values of (a).

Table 1: three game states under detection noise

It can be seen from table 1 that the non-strategic Q-learning algorithm is not affected by the detection noise disturbance. And the strategy Q learning algorithm is relatively greatly influenced by the detection noise interference. The effectiveness of the non-policy Q learning algorithm is demonstrated.

FIG. 1 and FIG. 2 are matrices under the strategy Q learning algorithm, respectively

And

and controlling the gain

And

fig. 3, 4 and 5 are convergence images of the system state x under three different detection noises of the strategy Q learning algorithm. FIG. 6 and FIG. 7 are matrices under the non-strategic Q learning algorithm, respectively

And

and controlling the gain

And

fig. 8 is a convergence image of the system state x under the non-policy Q learning algorithm.

Claims

1. A multi-body optimization control method based on non-strategy Q learning is characterized in that the method firstly provides a non-zero and game optimization problem, strictly proves that a value function defined according to individual performance indexes is a linear quadratic form, then provides a non-strategy Q learning algorithm based on a dynamic programming and Q learning method, obtains an approximate optimal solution of the non-zero and game and realizes the global Nash balance of a system; the algorithm does not require that the parameters of the system model are known, and can completely utilize measurable data to learn Nash equilibrium solution; finally, calculating the effectiveness of the simulation verification method;

the method comprises the following specific steps:

3) performing example simulation, namely providing an example, and performing program simulation by using a newly proposed algorithm to prove the effectiveness of the algorithm and the convergence of data;

the non-strategy Q learning algorithm gives out a strategy Q learning algorithm without a model,