CN110083063B - Multi-body optimization control method based on non-strategy Q learning - Google Patents

Multi-body optimization control method based on non-strategy Q learning Download PDF

Info

Publication number
CN110083063B
CN110083063B CN201910352788.5A CN201910352788A CN110083063B CN 110083063 B CN110083063 B CN 110083063B CN 201910352788 A CN201910352788 A CN 201910352788A CN 110083063 B CN110083063 B CN 110083063B
Authority
CN
China
Prior art keywords
learning
strategy
game
algorithm
zero
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910352788.5A
Other languages
Chinese (zh)
Other versions
CN110083063A (en
Inventor
李金娜
肖振飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Shihua University
Original Assignee
Liaoning Shihua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Shihua University filed Critical Liaoning Shihua University
Priority to CN201910352788.5A priority Critical patent/CN110083063B/en
Publication of CN110083063A publication Critical patent/CN110083063A/en
Application granted granted Critical
Publication of CN110083063B publication Critical patent/CN110083063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

Based on non-strategyQThe invention discloses a learning multi-body optimization control method, relates to an optimization control method, and provides a non-strategy aiming at the problems of discrete linear non-zero sum gameQAnd (5) learning an algorithm. First, non-zero and game optimization problems are presented and the value functions defined from individual performance indicators are strictly proven to be linear quadratic. Then, based on the dynamic planning sumQLearning method, giving out non-strategyQAnd (4) learning algorithm to obtain the approximate optimal solution of non-zero sum game and realize the global Nash balance of the system. Finally, the validity of the simulation verification method is calculated. The invention is used for solving the problems of multiple non-zero and game of the linear discrete system, and the effectiveness of the algorithm is verified by simulation; the invention relates to game theory and non-strategy
Figure DEST_PATH_IMAGE002
The learning method is integrated, and a non-strategy is provided under the framework of non-zero game and game
Figure 226400DEST_PATH_IMAGE002
Learning algorithm, learning optimal control strategy, realizing global Nash of the whole system
Figure DEST_PATH_IMAGE004
And (4) equalizing.

Description

Multi-body optimization control method based on non-strategy Q learning
Technical Field
The invention relates to an optimization control method, in particular to a multi-body optimization control method based on non-strategy Q learning.
Background
Adaptive Dynamic Programming (ADP) is a method for solving an approximate optimal solution, and is widely applied to the current optimal control. The method utilizes the solution of Hamilton-Jacobi-Bellman equation to solve the approximate optimal solution of the system through iteration. A large number of literature reports exist for researching the optimization control problem of the model-free system by adopting the self-adaptive dynamic programming. Researching and calculating the self-adaptive optimal control of a linear continuous time system of completely unknown system dynamics; researching the H-infinity control problem of the data-driven nonlinear distributed parameter system; researching a controlled adaptive dynamic programming algorithm and the stability thereof; researching data-driven strategy gradient self-adaptive dynamic programming optimal control; and researching the self-adaptive dynamic planning of the optimal tracking control of the unknown nonlinear system for coal gasification. Learning an optimal control strategy by adopting a reinforcement learning method is widely applied, the system performance is optimized, the control input constraint condition is met, certain indexes of the system performance are realized, and the feedback control under reinforcement learning and approximate dynamic planning is researched; researching feedback control based on reinforcement learning, and setting an adaptive optimal controller by using a natural decision method; researching linear quadratic tracking control of a continuous system partially unknown based on reinforcement learning; and researching the optimal tracking control of the unknown constrained input system based on the nonlinear part of the integral reinforcement learning.
An off-policy learning method is a learning method that can learn an optimal control strategy without depending on a model, using only collected system data in iterative update. It has three significant advantages compared to the strategy (on-policy) learning method: 1) the defect of insufficient exploration on the system is overcome; 2) no interference with the operation of the system during the learning process and no update of the interference input in a prescribed manner is required; 3) when solving the exact solution under the condition of satisfying the continuous excitation, there is no deviation even if the detection noise is added to the system input. It is noted that the literature adopts a strategy learning method to research the optimal control problem of the system. Documents that employ the non-policy learning method include: researching the H-infinity control problem of the model-free non-strategy reinforcement learning continuous system; researching the optimal operation control of the double-time-scale industrial process under the non-strategy reinforcement learning; researching the H-infinity control problem of a linear discrete system under the non-strategy reinforcement learning; researching the H infinite control design problem of non-strategy reinforcement learning; and (3) researching the optimal control of the affine nonlinear discrete system under the non-strategy interleaving Q learning. A multi-agent cooperative control system, or a dynamic system with multiple decision quantities and multiple control inputs, is widely available in modern social production. In the non-zero-sum game, each agent needs to adopt an optimal control strategy to optimize performance indexes by itself. Documents that solve the game problem by using a non-strategy learning method include: researching non-strategy reinforcement learning in multi-agent graph synchronous game; the optimal control problem of the unknown system of two individuals with disturbance under non-strategic learning is studied, and the like.
The learners already adopt a non-strategic Q learning method to research the optimal self-adaptive control strategy of a single system, so whether the non-strategic Q learning method can research the optimal control problem of a plurality of system games, and how to design the non-strategic Q learning method under the condition that a system model is unknown realizes the Nash of a plurality of systems
Figure 622930DEST_PATH_IMAGE001
Equalization, which is the problem concerned by the present invention, has not been reported in the related literature.
The invention aims to provide a multi-body optimization control method based on non-strategy Q learning, and provides a non-strategy Q learning method which is used for solving the problems of multi-body nonzero and game of a linear discrete system and verifying the effectiveness of the method in a simulation way; the invention integrates game theory and non-strategy Q learning method, and provides the non-strategy under the framework of non-zero game
Figure 15865DEST_PATH_IMAGE002
Learning optimal control strategies to achieve global Nash for the entire system
Figure 915688DEST_PATH_IMAGE001
And (4) equalizing.
The purpose of the invention is realized by the following technical scheme:
a multi-body optimization control method based on non-strategy Q learning firstly provides a non-zero and game optimization problem, strictly proves that a value function defined according to individual performance indexes is a linear quadratic form, then provides a non-strategy Q learning algorithm based on a dynamic programming and Q learning method, obtains an approximate optimal solution of the non-zero and game, and realizes the global Nash balance of a system; the algorithm does not require that the parameters of the system model are known, and can completely utilize measurable data to learn Nash equilibrium solution; finally, calculating the effectiveness of the simulation verification method;
the method comprises the following specific steps:
1) firstly, describing discrete linear non-zero and game problems, and then proving that the value function of an individual is a linear quadratic form;
2) solving a non-zero sum game, and providing a non-strategy Q learning algorithm;
3) and (4) performing example simulation, namely performing program simulation by using the newly proposed algorithm to prove the effectiveness of the algorithm and the convergence of data.
According to the multi-body optimization control method based on non-strategic Q learning, the value function of the proved individual is a linear quadratic form, and the following linear discrete system equation is considered:
Figure 419482DEST_PATH_IMAGE003
wherein
Figure 393254DEST_PATH_IMAGE004
And
Figure 19407DEST_PATH_IMAGE005
is a control input to the control unit,
Figure 344209DEST_PATH_IMAGE006
is an adaptive matrix. Design state feedback controller
Figure 448432DEST_PATH_IMAGE007
Minimizing each individual
Figure 539360DEST_PATH_IMAGE008
Own performance index
Figure 336415DEST_PATH_IMAGE009
Figure 882934DEST_PATH_IMAGE010
Wherein the content of the first and second substances,
Figure 56426DEST_PATH_IMAGE011
the multi-body optimization control method based on non-strategy Q learning comprises the steps of providing a model-free strategy Q learning algorithm,
Figure 4790DEST_PATH_IMAGE012
the Q function matrix H in equation (19) is learned to obtain optimum control gains for a plurality of individuals.
According to the multi-body optimization control method based on non-strategic Q learning, the validity of the non-strategic Q learning algorithm is proved through the example simulation.
The invention has the advantages and effects that:
the invention integrates game theory and non-strategy Q learning method, provides the non-strategy Q learning method under the framework of non-zero and game, learns the optimal control strategy, and realizes the global Nash of the whole system
Figure 972746DEST_PATH_IMAGE013
And (4) equalizing. Firstly, defining controllers of a plurality of intelligent agents through dynamic programming, and then obtaining a game Bellman equation based on a non-policy Q function
Figure 6562DEST_PATH_IMAGE013
Figure 452586DEST_PATH_IMAGE014
Obtaining a non-strategy Q learning method, and finally verifying the effectiveness of the method by an algorithm.
Drawings
FIG. 1 shows the convergence of H under the strategy Q learning method;
FIG. 2 shows K convergence under a strategy Q learning method;
FIG. 3 is a system state x of a first scenario under the strategy Q learning method;
FIG. 4 shows a system state x of a second scheme under the strategy Q learning method;
FIG. 5 shows a system state x of a third scenario under the strategy Q learning method;
FIG. 6 convergence of H under the non-strategic Q learning method;
FIG. 7 convergence of K under the non-strategic Q learning method;
FIG. 8 System State x under the non-strategic Q learning approach.
Detailed Description
The present invention will be described in detail with reference to examples.
1. Problem elucidation
The discrete linear non-zero and gambling problems are first explained and then the individual value functions are proven to be linear quadratic.
Consider the following linear discrete system equation:
Figure 521037DEST_PATH_IMAGE015
wherein
Figure 597577DEST_PATH_IMAGE016
And
Figure 181005DEST_PATH_IMAGE017
is a control input to the control unit,
Figure 368404DEST_PATH_IMAGE018
is an adaptive matrix. Design state feedback controller
Figure 353677DEST_PATH_IMAGE019
Minimizing each individual
Figure 601119DEST_PATH_IMAGE020
Own performance index
Figure 671843DEST_PATH_IMAGE021
Figure 490082DEST_PATH_IMAGE022
Wherein the content of the first and second substances,
Figure 329862DEST_PATH_IMAGE023
problem 1: the performance indexes are as follows:
Figure 748205DEST_PATH_IMAGE024
the constraint conditions are as follows:
Figure 40646DEST_PATH_IMAGE025
respectively defining an optimal value function according to the performance index formula (3)
Figure 835426DEST_PATH_IMAGE026
And the optimal Q function is:
Figure 529713DEST_PATH_IMAGE027
and
Figure 118957DEST_PATH_IMAGE028
thus, the relationship between the two is:
Figure 101957DEST_PATH_IMAGE029
theorem 1: for game problem 1, if control is entered
Figure 497166DEST_PATH_IMAGE030
Is an allowable control, the optimum function and the optimum Q function can be expressed as quadratic forms as follows:
Figure 983642DEST_PATH_IMAGE031
and
Figure 71684DEST_PATH_IMAGE032
wherein
Figure 276400DEST_PATH_IMAGE033
Figure 209721DEST_PATH_IMAGE034
And is
Figure 816283DEST_PATH_IMAGE035
Certificate (certificate)
Figure 75226DEST_PATH_IMAGE036
Wherein
Figure 498730DEST_PATH_IMAGE037
. Further comprises
Figure 439004DEST_PATH_IMAGE038
(11)
And
Figure 962389DEST_PATH_IMAGE039
wherein the content of the first and second substances,
Figure 64337DEST_PATH_IMAGE040
further, in the method, the number of the main components is more than one,
Figure 305963DEST_PATH_IMAGE041
and the number of the first and second electrodes,
Figure 784348DEST_PATH_IMAGE042
Figure 162240DEST_PATH_IMAGE043
represented by the formula (6, 12, 13) is
Figure 700669DEST_PATH_IMAGE044
Wherein, in the step (A),
Figure 164011DEST_PATH_IMAGE045
after the syndrome is confirmed.
2. Solving non-zero sum games
The invention mainly provides a non-strategy Q learning method. It is well known that the basis of gaming is nash equilibrium.
Definition 1: nash equilibrium.
For all
Figure 446088DEST_PATH_IMAGE046
All satisfy the following n inequalities
Figure 881749DEST_PATH_IMAGE047
Then this n-ary policy group
Figure 387816DEST_PATH_IMAGE048
Then the nash equilibrium of n-element limited game under the generalized form is formed. And N-tuple
Figure 541717DEST_PATH_IMAGE049
It is the nash equilibrium result of this n-gram.
From equations (5) and (6), the following bellman equation based on the Q function is obtained according to the dynamic programming:
Figure 424222DEST_PATH_IMAGE050
then, the game Bellman equation of the optimal Q function is subjected to partial derivation, and the optimal control gain of each individual can be obtained
Figure 717319DEST_PATH_IMAGE051
Figure 394288DEST_PATH_IMAGE052
Figure 769906DEST_PATH_IMAGE053
It is possible to obtain:
Figure 456102DEST_PATH_IMAGE054
will be in formula (18)
Figure 600776DEST_PATH_IMAGE055
The Rika equation (26) is substituted to obtain the optimal Q function
Figure 714225DEST_PATH_IMAGE056
The equation:
Figure 577139DEST_PATH_IMAGE057
the following have been demonstrated in the literature
Figure 801447DEST_PATH_IMAGE058
Ensuring that system formula (1) realizes Nash equilibrium.
Figure 66206DEST_PATH_IMAGE059
Figure 350557DEST_PATH_IMAGE060
(20)
Note 1: it can be seen from the equations (18) and (20) that neither the bellman equation nor the ricatt equation for the optimal Q function, in which the matrices H are coupled to each other and the K values are also coupled to each other, is well solved. Therefore, the strategy Q learning algorithm is given below.
2.1 strategy Q learning Algorithm
The following gives a model-free strategy Q learning algorithm, which learns the Q function matrix H in equation (19) to obtain the optimal control gains for multiple individuals.
Algorithm 1: strategy Q learning algorithm
1. Initialization: giving initial values of control gains for a plurality of bodies
Figure 435188DEST_PATH_IMAGE061
. Wherein
Figure 463187DEST_PATH_IMAGE062
Figure 582452DEST_PATH_IMAGE063
Is an iteration index;
2. by solving in Q function
Figure 975387DEST_PATH_IMAGE064
To perform policy evaluation:
Figure 875210DEST_PATH_IMAGE065
3. and (3) policy updating:
Figure 376074DEST_PATH_IMAGE066
wherein the control gain term of the ith individual can be expressed as:
Figure 412163DEST_PATH_IMAGE067
(23)
4. when in use
Figure 976000DEST_PATH_IMAGE068
Wherein
Figure 97540DEST_PATH_IMAGE069
The iteration stops when it is a minimum value.
Note 2: in policy updating, find
Figure 405024DEST_PATH_IMAGE070
And further can find
Figure 561199DEST_PATH_IMAGE071
Thereby can find
Figure 295937DEST_PATH_IMAGE072
. Namely:
Figure 904773DEST_PATH_IMAGE073
when in use
Figure 750369DEST_PATH_IMAGE074
When the time is about to be infinite,
Figure 761050DEST_PATH_IMAGE075
tend to be
Figure 666690DEST_PATH_IMAGE076
Which is
Figure 762821DEST_PATH_IMAGE077
Converge on
Figure 412109DEST_PATH_IMAGE078
Note 3: because the strategy Q learning algorithm has deviation, but the non-strategy Q learning algorithm has a plurality of advantages compared with the strategy Q learning algorithm, the deviation can be eliminated. Therefore, the following subsection proposes that the non-strategy Q learning algorithm has a plurality of advantages and can eliminate deviation. Therefore, the following subsection presents a non-strategic Q learning algorithm
A Q function-based non-policy algorithm is provided, and the algorithm is a model-free data-driven algorithm and is used for solving the problems of non-zero and game of multiple individuals.
From equation (21), it follows:
Figure 542876DEST_PATH_IMAGE079
wherein the content of the first and second substances,
Figure 619416DEST_PATH_IMAGE080
adding auxiliary variables to system equation (1)
Figure 866159DEST_PATH_IMAGE081
It is possible to obtain:
Figure 381454DEST_PATH_IMAGE082
in this formula
Figure 304411DEST_PATH_IMAGE083
Figure 614169DEST_PATH_IMAGE084
Is a behavior control strategy, used to generate data,
Figure 356997DEST_PATH_IMAGE085
the target control strategy of the individual who needs to learn. When the state trajectory of the system is equation (27), it can be derived:
Figure 410404DEST_PATH_IMAGE086
due to the fact that
Figure 187867DEST_PATH_IMAGE087
And
Figure 668527DEST_PATH_IMAGE088
there are corresponding relations (14) and (15). Further, it can be deduced that:
Figure 164230DEST_PATH_IMAGE089
the method is simplified and can be obtained:
Figure 755749DEST_PATH_IMAGE090
Figure 653298DEST_PATH_IMAGE091
Figure 304859DEST_PATH_IMAGE092
wherein the content of the first and second substances,
Figure 22279DEST_PATH_IMAGE093
the above formula (29) can be rewritten as:
Figure 417488DEST_PATH_IMAGE094
wherein the content of the first and second substances,
Figure 169544DEST_PATH_IMAGE095
in the formula
Figure 926760DEST_PATH_IMAGE096
Figure 193793DEST_PATH_IMAGE097
And is
Figure 330376DEST_PATH_IMAGE098
Figure 936938DEST_PATH_IMAGE099
. Wherein in the formula
Figure 930302DEST_PATH_IMAGE100
And is and
Figure 622314DEST_PATH_IMAGE101
Figure 624905DEST_PATH_IMAGE102
Figure 85974DEST_PATH_IMAGE103
based on the above, can obtain
Figure 250239DEST_PATH_IMAGE104
In the following form:
Figure 429547DEST_PATH_IMAGE105
Figure 970250DEST_PATH_IMAGE106
(32)
and 2, algorithm: non-strategy Q learning algorithm
2. Data acquisition: collecting and storing data
Figure 285825DEST_PATH_IMAGE107
Figure 824254DEST_PATH_IMAGE108
And
Figure 287596DEST_PATH_IMAGE109
collecting and storing a sample set by data;
3. initialization: giving initial values of control gains for a plurality of bodies
Figure 572603DEST_PATH_IMAGE110
And the system formula (1) must be stable. Wherein
Figure 805001DEST_PATH_IMAGE111
Figure 514331DEST_PATH_IMAGE112
Is an iteration index;
4. implementing a Q learning algorithm: using the data obtained in the first step, updating by iteratively solving the equation (31) through the algorithm
Figure 402652DEST_PATH_IMAGE113
A value of (d);
5. if it is not
Figure 285158DEST_PATH_IMAGE114
The process is stopped and the process is stopped,
Figure 575325DEST_PATH_IMAGE115
is a very small value. If not, then,
Figure 517873DEST_PATH_IMAGE116
and returns to the third step.
Note 4: the solution of equation (31) is equivalent to the solution of equation (21), and it is confirmed that it converges to the optimal solution
Figure 893491DEST_PATH_IMAGE117
4. Example simulation
In this section, an example is given of a program simulation using the newly proposed algorithm to demonstrate the effectiveness of the algorithm and the convergence of the data.
Consider a more complex, linearly discrete system with two individuals playing a non-zero and game.
Figure 579687DEST_PATH_IMAGE118
Sampling time
Figure 724360DEST_PATH_IMAGE119
Selecting
Figure 837810DEST_PATH_IMAGE120
And
Figure 700724DEST_PATH_IMAGE121
selecting
Figure 925032DEST_PATH_IMAGE122
. First, the most
Figure 189791DEST_PATH_IMAGE123
Function equation and optimization
Figure 208562DEST_PATH_IMAGE124
Corresponding to function equation
Figure 555843DEST_PATH_IMAGE125
Matrix sum
Figure 583842DEST_PATH_IMAGE126
Matrix, and optimal control gain for agent
Figure 703107DEST_PATH_IMAGE127
And
Figure 158359DEST_PATH_IMAGE128
in a
Figure 730286DEST_PATH_IMAGE129
The true solution is solved by using an iterative algorithm depending on a system model.
Figure 109446DEST_PATH_IMAGE130
It is well known that the addition of detection noise is necessary by ensuring sufficient excitation conditions, solving equation (16) accurately. The detection noise added by the invention has the following three types:
the first scheme is as follows:
Figure 411114DEST_PATH_IMAGE131
scheme II:
Figure 974951DEST_PATH_IMAGE132
the third scheme is as follows:
Figure 96491DEST_PATH_IMAGE133
wherein
Figure 403975DEST_PATH_IMAGE134
Giving control gains under three detection noises
Figure 512482DEST_PATH_IMAGE135
And
Figure 309536DEST_PATH_IMAGE136
table of values of (a).
Table 1: three game states under detection noise
Figure 856055DEST_PATH_IMAGE137
It can be seen from table 1 that the non-strategic Q-learning algorithm is not affected by the detection noise disturbance. And the strategy Q learning algorithm is relatively greatly influenced by the detection noise interference. The effectiveness of the non-policy Q learning algorithm is demonstrated.
FIG. 1 and FIG. 2 are matrices under the strategy Q learning algorithm, respectively
Figure 763968DEST_PATH_IMAGE138
And
Figure 712333DEST_PATH_IMAGE139
and controlling the gain
Figure 680289DEST_PATH_IMAGE140
And
Figure 714104DEST_PATH_IMAGE141
fig. 3, 4 and 5 are convergence images of the system state x under three different detection noises of the strategy Q learning algorithm. FIG. 6 and FIG. 7 are matrices under the non-strategic Q learning algorithm, respectively
Figure 425708DEST_PATH_IMAGE142
And
Figure 228579DEST_PATH_IMAGE143
and controlling the gain
Figure 305119DEST_PATH_IMAGE144
And
Figure 698667DEST_PATH_IMAGE145
fig. 8 is a convergence image of the system state x under the non-policy Q learning algorithm.

Claims (2)

1. A multi-body optimization control method based on non-strategy Q learning is characterized in that the method firstly provides a non-zero and game optimization problem, strictly proves that a value function defined according to individual performance indexes is a linear quadratic form, then provides a non-strategy Q learning algorithm based on a dynamic programming and Q learning method, obtains an approximate optimal solution of the non-zero and game and realizes the global Nash balance of a system; the algorithm does not require that the parameters of the system model are known, and can completely utilize measurable data to learn Nash equilibrium solution; finally, calculating the effectiveness of the simulation verification method;
the method comprises the following specific steps:
1) firstly, describing discrete linear non-zero and game problems, and then proving that the value function of an individual is a linear quadratic form;
2) solving a non-zero sum game, and providing a non-strategy Q learning algorithm;
3) performing example simulation, namely providing an example, and performing program simulation by using a newly proposed algorithm to prove the effectiveness of the algorithm and the convergence of data;
the non-strategy Q learning algorithm gives out a strategy Q learning algorithm without a model,
Figure DEST_PATH_IMAGE001
(19)
q function matrix in learning equation (19)
Figure 948472DEST_PATH_IMAGE002
Thereby obtaining the optimal control gains of a plurality of individuals;
wherein
Figure 406610DEST_PATH_IMAGE003
The value function of the proved individual is a linear quadratic form, and the following linear discrete system equation is considered:
Figure 774138DEST_PATH_IMAGE004
(1)
wherein
Figure 15763DEST_PATH_IMAGE005
And
Figure 228570DEST_PATH_IMAGE006
is a control input to the control unit,
Figure 809724DEST_PATH_IMAGE007
is an adaptive matrix; design state feedback controller
Figure 144890DEST_PATH_IMAGE008
Minimizing each individual
Figure 811495DEST_PATH_IMAGE009
Own performance index
Figure 765676DEST_PATH_IMAGE010
Figure 201336DEST_PATH_IMAGE011
(2)
Wherein
Figure 907737DEST_PATH_IMAGE012
Figure 858375DEST_PATH_IMAGE013
Is a controller gain matrix;
n-ary policy group
Figure 678564DEST_PATH_IMAGE014
So that the Nash equilibrium of the n-element limited game under the generalized form is formed;
the non-policy Q learning algorithm
1) Data acquisition: collecting and storing data
Figure 968731DEST_PATH_IMAGE015
And
Figure 911279DEST_PATH_IMAGE016
a sample set stored for data collection;
2) initialization: giving initial values of control gains for a plurality of bodies
Figure 224580DEST_PATH_IMAGE017
And must be made systematic
Figure 520563DEST_PATH_IMAGE018
Can be stable; wherein
Figure 930816DEST_PATH_IMAGE019
Figure 979019DEST_PATH_IMAGE020
Is an iteration index;
3) implementing a Q learning algorithm: using the data from the first step, iteratively solving by an algorithm
Figure 904249DEST_PATH_IMAGE021
Update
Figure 66240DEST_PATH_IMAGE022
A value of (d);
4) if it is not
Figure 65420DEST_PATH_IMAGE023
The process is stopped and the process is stopped,
Figure 287454DEST_PATH_IMAGE024
is a very small number; if not, then the mobile terminal can be switched to the normal mode,
Figure 637664DEST_PATH_IMAGE025
and returning to the third step;
wherein the content of the first and second substances,
Figure 665663DEST_PATH_IMAGE026
Figure 784929DEST_PATH_IMAGE027
Figure 912285DEST_PATH_IMAGE028
and is and
Figure 749791DEST_PATH_IMAGE029
Figure 315901DEST_PATH_IMAGE030
and is made of
Figure 552323DEST_PATH_IMAGE031
Figure 850580DEST_PATH_IMAGE032
Figure 237699DEST_PATH_IMAGE033
In order to be a matrix of the Q function,
Figure 545184DEST_PATH_IMAGE034
Figure 373463DEST_PATH_IMAGE035
Figure 108200DEST_PATH_IMAGE036
said
Figure 530086DEST_PATH_IMAGE037
Wherein the content of the first and second substances,
Figure 703578DEST_PATH_IMAGE038
Figure 586696DEST_PATH_IMAGE039
2. the method according to claim 1, wherein the example simulation proves the effectiveness of the non-strategic Q learning algorithm.
CN201910352788.5A 2019-04-29 2019-04-29 Multi-body optimization control method based on non-strategy Q learning Active CN110083063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910352788.5A CN110083063B (en) 2019-04-29 2019-04-29 Multi-body optimization control method based on non-strategy Q learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910352788.5A CN110083063B (en) 2019-04-29 2019-04-29 Multi-body optimization control method based on non-strategy Q learning

Publications (2)

Publication Number Publication Date
CN110083063A CN110083063A (en) 2019-08-02
CN110083063B true CN110083063B (en) 2022-08-12

Family

ID=67417405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910352788.5A Active CN110083063B (en) 2019-04-29 2019-04-29 Multi-body optimization control method based on non-strategy Q learning

Country Status (1)

Country Link
CN (1) CN110083063B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110782011B (en) * 2019-10-21 2023-11-24 辽宁石油化工大学 Distributed optimization control method of networked multi-agent system based on reinforcement learning
CN111882101A (en) * 2020-05-25 2020-11-03 北京信息科技大学 Control method based on supply chain system consistency problem under switching topology
CN111624882B (en) * 2020-06-01 2023-04-18 北京信息科技大学 Zero and differential game processing method for supply chain system based on reverse-thrust design method
CN112180730B (en) * 2020-10-10 2022-03-01 中国科学技术大学 Hierarchical optimal consistency control method and device for multi-agent system
CN112947084B (en) * 2021-02-08 2022-09-23 重庆大学 Model unknown multi-agent consistency control method based on reinforcement learning
CN113364386B (en) * 2021-05-26 2023-03-21 潍柴动力股份有限公司 H-infinity current control method and system based on reinforcement learning of permanent magnet synchronous motor
CN114200834B (en) * 2021-11-30 2023-06-30 辽宁石油化工大学 Optimal tracking control method for model-free off-track strategy in batch process in packet loss environment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107949025A (en) * 2017-11-02 2018-04-20 南京南瑞集团公司 A kind of network selecting method based on non-cooperative game
CN109121105A (en) * 2018-09-17 2019-01-01 河海大学 Operator's competition slice intensified learning method based on Markov Game

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107949025A (en) * 2017-11-02 2018-04-20 南京南瑞集团公司 A kind of network selecting method based on non-cooperative game
CN109121105A (en) * 2018-09-17 2019-01-01 河海大学 Operator's competition slice intensified learning method based on Markov Game

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
H∞ Control for Discrete-time Linear Systems by Integrating Off-policy Q-learning and Zero-sum Game*;Jinna Li 等;《ICCA》;20180823;第817-822页 *
Optimal Adaptive Control and Differential Games by Reinforcement Leanring Principles;Warren Dixon;《IEEE》;20140514;第17-18、195-235页 *
Warren Dixon.Optimal Adaptive Control and Differential Games by Reinforcement Leanring Principles.《IEEE》.2014,第17-18、195-235页. *
混合多Agent环境下动态策略强化学习算法;肖正 等;《小型微型计算机系统》;20090731;第30卷(第7期);全文 *

Also Published As

Publication number Publication date
CN110083063A (en) 2019-08-02

Similar Documents

Publication Publication Date Title
CN110083063B (en) Multi-body optimization control method based on non-strategy Q learning
Fu et al. Online solution of two-player zero-sum games for continuous-time nonlinear systems with completely unknown dynamics
Chen et al. Generalized Hamilton–Jacobi–Bellman formulation-based neural network control of affine nonlinear discrete-time systems
Wu et al. Fuzzy adaptive event-triggered control for a class of uncertain nonaffine nonlinear systems with full state constraints
CN107272403A (en) A kind of PID controller parameter setting algorithm based on improvement particle cluster algorithm
CN110083064B (en) Network optimal tracking control method based on non-strategy Q-learning
Nikdel et al. Improved Takagi–Sugeno fuzzy model-based control of flexible joint robot via Hybrid-Taguchi genetic algorithm
CN101390024A (en) Operation control method, operation control device and operation control system
Zhao et al. Neural network-based fixed-time sliding mode control for a class of nonlinear Euler-Lagrange systems
Hashemi et al. Integrated fault estimation and fault tolerant control for systems with generalized sector input nonlinearity
Mu et al. An ADDHP-based Q-learning algorithm for optimal tracking control of linear discrete-time systems with unknown dynamics
CN113325717B (en) Optimal fault-tolerant control method, system, processing equipment and storage medium based on interconnected large-scale system
CN116661307A (en) Nonlinear system actuator fault PPB-SIADP fault-tolerant control method
CN114839880A (en) Self-adaptive control method based on flexible joint mechanical arm
Zong et al. Input-to-state stability-modular command filtered back-stepping control of strict-feedback systems
CN111624882B (en) Zero and differential game processing method for supply chain system based on reverse-thrust design method
CN116880191A (en) Intelligent control method of process industrial production system based on time sequence prediction
Vamvoudakis et al. Non-zero sum games: Online learning solution of coupled Hamilton-Jacobi and coupled Riccati equations
Gao et al. Robust resilient control for parametric strict feedback systems with prescribed output and virtual tracking errors
CN113485099B (en) Online learning control method of nonlinear discrete time system
CN112346342B (en) Single-network self-adaptive evaluation design method of non-affine dynamic system
CN108181808B (en) System error-based parameter self-tuning method for MISO partial-format model-free controller
Wakitani et al. Design of a cmac-based pid controller using operating data
WO2019086243A1 (en) Randomized reinforcement learning for control of complex systems
CN108803314A (en) A kind of NEW TYPE OF COMPOSITE tracking and controlling method of Chemical Batch Process

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20190802

Assignee: Liaoning Hengyi special material Co.,Ltd.

Assignor: Liaoming Petrochemical University

Contract record no.: X2023210000276

Denomination of invention: A Multi individual Optimization Control Method Based on Non Policy Q-Learning

Granted publication date: 20220812

License type: Common License

Record date: 20231130

EE01 Entry into force of recordation of patent licensing contract