CN112947084B - Model unknown multi-agent consistency control method based on reinforcement learning - Google Patents

Model unknown multi-agent consistency control method based on reinforcement learning Download PDF

Info

Publication number
CN112947084B
CN112947084B CN202110184288.2A CN202110184288A CN112947084B CN 112947084 B CN112947084 B CN 112947084B CN 202110184288 A CN202110184288 A CN 202110184288A CN 112947084 B CN112947084 B CN 112947084B
Authority
CN
China
Prior art keywords
agent
control
following
equation
optimal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202110184288.2A
Other languages
Chinese (zh)
Other versions
CN112947084A (en
Inventor
陈刚
林卓龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202110184288.2A priority Critical patent/CN112947084B/en
Publication of CN112947084A publication Critical patent/CN112947084A/en
Application granted granted Critical
Publication of CN112947084B publication Critical patent/CN112947084B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention relates to a model unknown multi-agent consistency control method based on reinforcement learning, and belongs to the field of intellectualization. The scheme adopted in the invention firstly when designing the self-adaptive distributed observer consists of three steps. First, an adaptive distributed observer was designed to estimate the state of the system matrix and the leader system. Secondly, a method for calculating the equation solution of the observer on line is provided after the adaptive distributed observer is designed. Thirdly, in order to eliminate few extreme cases, under the condition that each follower does not know the matrix of the leader system, the problem of distributed consistency output adjustment of the system is solved by integrating adaptive state feedback and adaptive measurement output feedback control. According to the estimated state, the controller is designed by adopting a reinforcement learning-based method, and the optimal solution is obtained by an iteration method, so that the optimal control of the multi-agent system is realized.

Description

Model unknown multi-agent consistency control method based on reinforcement learning
Technical Field
The invention belongs to the field of intellectualization, and relates to a model unknown multi-agent consistency control method based on reinforcement learning.
Background
The research on the problem of consistency control of multi-agent systems dates back to the 80 s of the last century, and the research on related multi-agent technologies began at the earliest from the research on mobile robots. The field of multi-agent system consistency control research has developed rapidly over the last fifteen years, and many new systems have been proposed that have extended from military operations to mobile sensor networks, commercial highways, air transportation, and emergency and disaster relief. However, with the constraint of control quality, the distributed optimal consistency problem is always a great challenge in the control field nowadays. The distributed consistency of a multi-agent system not only needs to satisfy the consistency of the behavior of each agent, but also needs to optimize the performance index of the whole system. In a more rigorous sense, distributed consistency control of multi-agent systems is to be agreed upon at as low a cost as possible. However, the mainstream outstanding learner for researching multi-agent control has given various ideas aiming at the problem of consistency control of multi-agent systems: such as linear quadratic regulation techniques, adaptive learning methods, model predictive control techniques, and fuzzy adaptive dynamic programming.
In recent decades, Reinforcement Learning (RL) has been a control protocol that can be designed without knowledge or identification of system dynamics, and is not model-based, and thus has gained much attention and broad application prospects. Reinforcement learning is motivated by biological systems to find the optimal control strategy by optimizing cumulative rewards, interacting with a given unknown environment to learn the best strategy to maximize its long-term performance. The Rl algorithm is based on the fact that a successful control strategy should be remembered, and then by reinforcing this signal, they are more likely to be used a second time. From the beginning of reinforcement learning research, the RL method has gained much attention in the field of intelligent agent research. The research on the relevant reinforcement learning which is mainstream at present is usually realized on an operator-critic structure, a critic evaluates the performance of the current strategy according to measured data, and a performer finds an improved strategy by using the strategy evaluated by the critic. Compared with a classical dynamic planning method, the reinforcement learning method provides a feasible method capable of avoiding dimension explosion. On the other hand, compared with the traditional adaptive controller, the reinforcement learning method only needs to consider the dynamics of the tracking error, can reduce the transient response bringing errors to the system to the maximum extent, and simultaneously ensures the stability of the whole system. The main advantage of the Reinforcement Learning (RL) algorithm to solve the optimal control problem is that it can obtain enough data information from the system without having to solve the system dynamics, and then iteratively improve between two steps of policy evaluation and policy improvement based on a policy iteration technique.
In the research on multi-agent consistency control, the system is partially unknown, and the follower can observe the state of the leader at any time, so that the whole system achieves behavior consistency by constructing a communication network between the leader and the follower and a communication network between the followers. In most cases, the state of the system cannot be directly measured by the sensor, but the input and output of the system can be measured by various methods. The popular method of today is to make an estimate of the state of the system by constructing a full-dimensional observer; such as considering a simple linear system as follows:
Figure BDA0002942369580000021
an analog linear system identical to this system was constructed at the same time:
Figure BDA0002942369580000022
where ω and γ are the input and output of the simulation system, and are also estimates of the original system, designing the simulation system andthe error value of the original system is e-omega-x, in order to make the state estimation error e tend to 0, the state estimation error e can be converted into an output estimation error gamma-y which can be measured, according to the general principle of feedback control, only the output estimation error gamma-y is required to be fed back to the state of the analog system
Figure BDA0002942369580000023
The controller is then designed so that the output estimation error approaches 0, i.e. the state estimation error also approaches 0 at this time. And introducing a state observer to output a feedback matrix H to obtain the following form:
Figure BDA0002942369580000024
the output equation of the original system and the output equation of the full-dimensional state observer are brought into the state equation of the full-dimensional state observer to obtain:
Figure BDA0002942369580000025
where A-HC is the system matrix of the full-dimensional observer. The key problem in designing a full-dimensional state observer is to ensure that the state estimation error approaches 0 under any initial condition, i.e. the method is applied to the field of view
Figure BDA0002942369580000026
The method can be obtained by an original system and a full-dimensional observer system:
Figure BDA0002942369580000027
solving the above equation to obtain
Figure BDA0002942369580000028
From the solution, x (t) 0 )≠ω(t 0 );
When x (t) is ω (t), it is always true, and when x (t) 0 )=ω(t 0 ) Then, the feedback matrix H only needs to be adjusted so that the eigenvalues of the a-HC matrix have negative real parts.
In a multi-agent system, a state value is difficult to observe accurately in real time, a corresponding self-adaptive distributed observer is designed for each follower in the design, and then under the guidance of an output regulation theory, a collaborative self-adaptive optimal output regulation problem is decomposed into a feedforward control design problem and a self-adaptive optimal feedback control problem, wherein the feedforward control design problem and the self-adaptive optimal feedback control problem can solve nonlinear regulator equations. And finally, establishing an HJB (Hamilton Jacobi Bellman) equation of the established cost function by establishing the cost function of the multi-agent system based on reinforcement learning, solving the HJB equation by a method based on synchronous reinforcement learning, and finally obtaining an optimal solution by an iterative method to realize the optimal control of the multi-agent system.
Disclosure of Invention
In view of the above, the present invention provides a model-unknown multi-agent consistency control method based on reinforcement learning. The optimal feedback control problem of the multi-agent and the design problem of the optimal controller under the conditions that the leader model is unknown and the follower state is not measurable are solved by establishing a model-state-based multi-agent input structure and designing a corresponding reinforcement learning algorithm to solve an HJB equation.
In order to achieve the purpose, the invention provides the following technical scheme:
a model unknown multi-agent consistency control method based on reinforcement learning comprises the following steps:
s1: performing single agent optimal output control based on reinforcement learning;
s2: multi-agent consistency control based on reinforcement learning.
Optionally, the S1 specifically includes:
when the optimal controller of a single intelligent agent is designed, a non-strategy reinforcement learning algorithm is adopted to learn and track the solution of the HJB equation on line, and the following system models are considered:
Figure BDA0002942369580000031
where x, u are the status and control inputs of the system, respectively, and d is the external disturbance; assuming that f (x), g (x), l (x) are Lipchitz functions, and f (0) ═ 0, the system is robust and stable;
assume that p (t) is a condition of consistency that needs to be achieved and satisfies the following form:
Figure BDA0002942369580000032
and h (0) is equal to 0, the tracking error of the whole system is defined as:
Figure BDA0002942369580000033
simultaneous reaction of (1-1), (1-2) and (1-3) to obtain:
Figure BDA0002942369580000034
the following virtual performance outputs are defined to meet the requirements:
Figure BDA0002942369580000041
defining a performance function for the system:
Figure BDA0002942369580000042
suppose the system is in u * The initial satisfaction cost is minimized, then
Figure BDA0002942369580000043
The following bellman equation is given:
Figure BDA0002942369580000044
wherein,
Figure BDA0002942369580000045
is an amplification system designed for the system;
according to the optimization conditions
Figure BDA0002942369580000046
And
Figure BDA0002942369580000047
obtaining an optimal control input and an optimal interference input:
Figure BDA0002942369580000048
Figure BDA0002942369580000049
wherein V * Is the optimized value function defined in (1-7);
and (4) obtaining the following HJB equation of track tracking according to the optimal input conditions obtained in the step (1-10):
Figure BDA00029423695800000410
thus, the following single agent offline RL algorithm is obtained:
solving HJB equation based on RL algorithm
Step 11: initialization: given an allowable stability control strategy value u 0
Step 12: and (3) policy evaluation: for oneA control input u i And a disturbance input d i The following Bellman equation was used:
Figure BDA0002942369580000051
step 13: interference d of updating system i
Figure BDA0002942369580000052
Step 14: updating input u of system i
Figure BDA0002942369580000053
Step 15: step 11 is re-executed.
Optionally, the S2 specifically includes:
s21: establishing a graph theory:
let G ═ (V, E, a) be a weighted graph, which describes the information channels between N agents; v is follower node { V 1 ,v 2 ,…v N A non-empty finite set of };
Figure BDA0002942369580000054
is an edge set;
Figure BDA0002942369580000055
is a weighted adjacency matrix, and when (v) i ,v j ) E is epsilon, a ij Is greater than 0; if it is
Figure BDA0002942369580000056
a ij 0 and for all i 1,2, … N, a ij 0; definition of N i ={v j ∈V:(v i ,v j ) E represents follower v i Neighbor follower set of (i.e. N) i All followers in (1) directly send information to the followers v i Definition ofMatrix D ═ diag (D) 1 ,d 2 ,...,d N ) Is an in-degree matrix, wherein
Figure BDA0002942369580000057
1,2, N; laplacian matrix L ═ D-a ═ L for directed graph G ═ (V, E, a) ij ]Wherein l is ij =-a ij
Figure BDA0002942369580000058
The sum of each row of the Laplace matrix L is zero, i.e., 1 N A right eigenvector of the laplacian matrix L is set, and its corresponding eigenvalue is zero; for a spanning tree, if there is only one node v i Indicating that there is a directed path from one node to any other node in the graph; there is a directed path from each node to any other node; for graphs with spanning trees, strong connectivity is a sufficiently unnecessary condition;
s22: problem description:
considering a multi-agent system consisting of one leader and N followers, and considering a form with a directed graph of communication g (x), the kinetic model of the ith follower is:
Figure BDA0002942369580000059
wherein
Figure BDA00029423695800000510
And
Figure BDA00029423695800000511
respectively the status and input of the ith follower,
Figure BDA00029423695800000512
and
Figure BDA00029423695800000513
internal function and input matrix function of the ith follower, respectively, and assume f i (x i ),g i (x i ) Is unknown, has f i (0) The system (2-1) has robust stability as 0;
the leader's kinetic model was:
Figure BDA0002942369580000061
wherein
Figure BDA0002942369580000062
In the form of the status of the leader,
Figure BDA0002942369580000063
unknown, D is a constant matrix, which is differentiable, and bounded | | | f (x) 0 )||≤ρ 0
According to the network topological relation between each follower and the adjacent agent, the local domain consistency error of the system is described as follows:
Figure BDA0002942369580000064
wherein
Figure BDA0002942369580000065
And b is a i ≧ 0, if and only if b i When the number is more than 0, the ith agent and the leader have communication; consistency information of multi-agent system is composed of consistency error e of local area i When t → ∞, e i → 0, the multi agent system will agree;
s23: adaptive distributed observer
By designing the adaptive distributed observer for each follower, the problem that in a multi-agent system, the follower estimates the state of a leader in real time under the condition that the state of the leader is unknown is solved, and the state of the follower relative to the leader is converted into the state of the adaptive distributed observer relative to the leader;
wherein, the self-adaptive distributed observer is as follows:
Figure BDA0002942369580000066
Figure BDA0002942369580000067
wherein x 0 =x,
Figure BDA0002942369580000068
D 0 =D,
Figure BDA0002942369580000069
Mu is more than 0; under the error description of the system, satisfy
Figure BDA00029423695800000610
i is 1,2, …, N, and satisfies
Figure BDA00029423695800000611
The adaptive distributed observer comprises a mechanism for estimating a matrix D, and a near unit of a leader knows the matrix;
using estimated values S of S i From the solution of the adaptive calculation equation, the following observer form is obtained:
Figure BDA00029423695800000612
Figure BDA0002942369580000071
s24: designing a multi-agent system controller based on reinforcement learning;
consider the following system model:
x k+1 =f(x k )+g(x k )u k (4-1)
wherein,
Figure BDA0002942369580000072
is the state of the system and is,
Figure BDA0002942369580000073
is the control input of the system, the system model can also use more concise x k+1 =F(x k ,u k ) Represents;
for each state x of a multi-agent system k The following control strategies are defined:
u k =h(x k ) (4-2)
the mapping form is also called as a feedback controller, and in the field of feedback control, a feedback control strategy is designed in many ways, including Riccati equation optimal solution, adaptive control, h-infinity control and classical frequency domain control;
to obtain the optimal control strategy of the system, the following cost function is designed for the system:
Figure BDA0002942369580000074
wherein, the discount factor 0 is more than gamma and less than or equal to 1, u k =h(x k ) Is a control strategy in the design;
or given in standard quadratic form:
Figure BDA0002942369580000075
suppose the system is at V * The cost paid by the place is minimum, and the optimal cost strategy is as follows:
Figure BDA0002942369580000076
when the optimal control strategy is taken, the optimal control value given by the system is as follows:
Figure BDA0002942369580000077
in the original system, the leader of the multi-agent system is considered to have the following model:
x k+1 =f(x k ) (4-7)
with a communication network diagram of a given system, the local coherence error of the system is defined as:
Figure BDA0002942369580000081
consensus information for multi-agent systems is represented by the above-mentioned systematic consensus error for the local domain, i.e., when t → 0, e i → 0, indicating that the system is about uniform;
designing an additional compensator, independent of the individual subsystems, defined by the input affine differential equation that is expected:
Figure BDA0002942369580000082
and combining the knowledge of the corresponding graph theory to obtain a global error form of (4-10):
e=L'(x-x 0 ) (4-10)
wherein,
Figure BDA0002942369580000083
and is provided with
Figure BDA0002942369580000084
Satisfies b ii =b i When i ≠ j, b ij =0
After the local error e is derived after the simultaneous operations (2-1) and (4-10), the local region consistency error is obtained relative to the graph G (x) as follows:
Figure BDA0002942369580000085
wherein, f e (t)=f(x(t))-f(x(0)),
Figure BDA0002942369580000086
L i The ith column vector of the Laplace matrix is represented; in combination with (4-10) and (4-11), the local domain coherence error is expressed as:
Figure BDA0002942369580000087
wherein,
Figure BDA0002942369580000088
and satisfies the following conditions:
Figure BDA0002942369580000089
Figure BDA00029423695800000810
likewise, returning to the system model at the beginning of the designed continuous time:
Figure BDA00029423695800000811
Figure BDA00029423695800000812
given a cost function for continuous-time multi-agent system consistency control:
Figure BDA0002942369580000091
then the correlation tracking Bellman equation is obtained by the affine differential equations defined by (4-9) and (4-13) using the Leibniz rule:
Figure BDA0002942369580000092
where U (u) is a positive definite integrand on the control input u:
Figure BDA0002942369580000093
then (4-15) is expressed by the following equation:
Figure BDA0002942369580000094
then, the following Hamiltonian equation is defined:
Figure BDA0002942369580000095
does not let V stand * If the system optimal control cost is obtained, the optimal cost function is defined as follows:
Figure BDA0002942369580000096
at an optimum cost V * Next, according to the Hamiltonian equation in (4-18), the following HJB equation is obtained:
Figure BDA0002942369580000097
when the stability condition is satisfied
Figure BDA0002942369580000098
Then, the following optimal control inputs are obtained:
Figure BDA0002942369580000099
the following strategy iteration algorithm is obtained:
the algorithm is as follows: HJB equation solving method based on strategy iteration method
Step 211: and (3) policy evaluation: given control input u i (x) Solving for V by the following Bellman equation i (X)
Figure BDA0002942369580000101
Step 212: (strategy improvement) the control strategy is updated by:
Figure BDA0002942369580000102
step 213: order to
Figure BDA0002942369580000103
Returning to step 211 until a minimum value is converged;
an integral reinforcement learning algorithm is introduced into a strategy iterative algorithm, and on a discrete time system (4-1), for any integral interval T >0, a value function in a continuous system (4-13) meets the following form:
Figure BDA0002942369580000104
tracking the solution of the Bellman equation by using an integral reinforcement learning algorithm, and solving the HJB equation by using integral reinforcement chemistry under the condition that a system dynamic model is unknown;
obtaining the following integral reinforcement learning algorithm based on strategy iteration:
the algorithm is as follows: strategy iteration-based HJB equation solving method by offline integral reinforcement learning algorithm
Step 221: and (3) policy evaluation: given control input u i (x) Solving for V by the following Bellman equation i (X)
Figure BDA0002942369580000105
Step 222: strategy improvement: the control strategy is updated by:
Figure BDA0002942369580000106
step 223: order to
Figure BDA0002942369580000107
Returning to step 221 until convergence to a minimum value;
s25: design of self-adaptive distributed observer based on reinforcement learning algorithm to realize consistency distributed control of multiple intelligent agents
Multi-agent system:
x i (k+1)=f i (x(k))+g i (x(k))u(k)
y i (k)=cx i (k) (5-1)
wherein x is i ,u i Y, respectively representing the state of the ith agent of the system, controlling inputs and outputs;
the leader model takes the following form into account:
ν(k+1)=Eν(k)
Figure BDA0002942369580000111
in the leader model referred to herein,
Figure BDA0002942369580000112
is the state of the leader system when agent i satisfies (v) 0 ,ν i ) E epsilon, when a communication connection exists between the follower i and the leader,
Figure BDA0002942369580000113
representing a known constant matrix; q satisfies the condition that Q (0) is 0,
Figure BDA0002942369580000114
is an external reference signal;
the observer is arranged:
Figure BDA0002942369580000115
Figure BDA0002942369580000116
wherein R is i (k) Represents an observed value of agent i relative to the leader at time k, and satisfies R 0 (k)=ν(k),W 0 (k)=W,
Figure BDA0002942369580000117
According to the system description, a cost function of the system is obtained according to the derivation of the optimal output problem formula of the linear system:
Figure BDA0002942369580000118
Figure BDA0002942369580000119
Figure BDA00029423695800001110
wherein, i is 1,2 i Is the discount factor of the number of bits of the file,
Figure BDA00029423695800001111
c=[1,0,0,…0]the optimal feedback input for each folower is obtained by solving equation (5-4):
Figure BDA00029423695800001112
writing (5-4) as a quadratic function when solving for the function optimal feedback inputNumber form, expressed as a function of the value of the system:
Figure BDA00029423695800001113
the following bellman equation is obtained:
Figure BDA0002942369580000121
according to the above bellman equation, the HJB equation in nonlinear optimal feedback is defined as:
Figure BDA0002942369580000122
when the stability condition is satisfied
Figure BDA0002942369580000123
Then, the following optimal control inputs are obtained:
Figure BDA0002942369580000124
wherein,
Figure BDA0002942369580000125
solving the HJB equation by adopting strategy iteration of IRL;
obtaining the following strategy iteration-based online IRL multi-agent optimal feedback control algorithm:
the algorithm is as follows: online IRL algorithm solution HJB equation based on strategy iteration
Step 231: initialization: selecting a control input
Figure BDA0002942369580000126
Repeating the following steps to know the convergence of the system;
Figure BDA0002942369580000127
step 232: strategy improvement: the control strategy is updated by:
Figure BDA0002942369580000128
step 233: let u i (k)=u i+1 (k) Return to step 231 until V i (k) Convergence to a minimum value;
on the basis of the system (5-1) and (5-2) models, consider the following first-order multi-agent system:
Figure BDA0002942369580000129
wherein,
Figure BDA00029423695800001210
respectively representing the state and control input of the ith agent at time k; tau. ij ≧ 0 denotes the communication time lag, τ, for data from agent j to agent i i The input time lag of the agent i is more than or equal to 0; consider a first-order discrete multi-agent system comprising n agents, the network topology of which is a static directed weighted graph and which comprises a globally reachable node if satisfied
Figure BDA0002942369580000131
Then there are: max { d { i (2τ i +1)}<1
The system is able to achieve a gradual agreement in which,
Figure BDA0002942369580000132
assume that a multi-agent system contains 5 nodes, and their corresponding adjacency matrices are as follows:
Figure BDA0002942369580000133
according to the setting, the input time lag of the intelligent agent should satisfy
Figure BDA0002942369580000134
Suppose that: tau. 13 =1s,τ 21 =0.75s,τ 32 =1.8s,τ 42 =2s,τ 51 When the input time lag tau is 0.8s and the initial state of the agent is randomly generated to be x (0) x (2.5,3,2,3.5,5), the agents finally and gradually get consistent; and the input time lag is changed into 3s, and the systems still achieve consistency.
The invention has the beneficial effects that:
1. the design combines with the related technology of dynamic planning, and provides a single intelligent agent reinforcement learning method based on strategy iteration;
2. the invention designs a self-adaptive distributed observer, which estimates the unmeasured state of a multi-agent system through the output information of an observer system, and solves the defect that the traditional full-dimensional observer cannot carry out information communication in real time;
3. the invention provides a method for calculating an observer equation solution on line, which solves the problems that the observer equation needs to be subjected to multi-order derivation and a Lyapunov function is constructed in multiple steps in the traditional method;
4. the invention provides a method for approximating the constructed value function by a strategy iterative algorithm based on integral reinforcement learning by means of an output adjustment technology to solve the HJB equation, thereby solving the problem that the traditional method is difficult to solve the HJB equation to obtain the optimal solution;
5. the invention adopts a reinforcement learning-based method to design a controller, obtains an optimal solution through an iteration method, integrates the knowledge in the aspect of the existing reinforcement learning, constructs a method for solving the problem of output adjustment of the distributed consistency of the system by adopting self-adaptive state feedback and self-adaptive measurement output feedback control under the condition that each follower does not know the state of the leader system, and realizes the distributed consistency control of the multi-agent system.
6. The invention provides a consistency control method based on reinforcement learning under different time lags, which considers the problem of constraint input, selects a required performance function as a non-quadratic penalty function, and then considers the problems of communication time lag between a system folower and a leader and input time lag between folwers.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For a better understanding of the objects, aspects and advantages of the present invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a diagram of an intelligent system node connection topology;
fig. 2 is a simulation result diagram.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
Aiming at the problem of distributed consistency control of a multi-agent system, the invention solves the consistency output regulation problem of a linear multi-agent system by designing a self-adaptive distributed observer for each follower on the premise that each follower knows a system matrix S of the leader system. The scheme adopted in the invention firstly when designing the self-adaptive distributed observer consists of three steps. First, an adaptive distributed observer is designed to estimate the state of the system matrix and the leader system. Secondly, a method for calculating the equation solution of the observer on line is provided after the adaptive distributed observer is designed. Thirdly, in order to eliminate few extreme cases, under the condition that each follower does not know the matrix of the leader system, the problem of distributed consistency output adjustment of the system is solved by integrating adaptive state feedback and adaptive measurement output feedback control.
After the self-adaptive distributed observer is constructed, the unmeasured state of the multi-agent system is estimated through the output information of the observer system, then, based on the estimated state, an optimal controller is designed by adopting a reinforcement learning method, meanwhile, the problem of constraint input is considered, the required performance function can be selected as a non-quadratic penalty function, then, the problems of communication time lag between a folower and a leader and input time lag between the folower are considered, and the consistency problem under different time lags is verified. The invention designs an ADP method-based reinforcement learning controller, which establishes an HJB (Hamilton Jacobi Bellman) equation of the established cost function by establishing the cost function of a reinforcement learning-based multi-agent system, solves the HJB equation by a synchronous reinforcement learning-based method, and finally obtains an optimal solution by an iterative method to realize the optimal control of the multi-agent system.
In order to achieve the purpose, the technical scheme of the invention is as follows:
the multi-agent distributed consistency control method adopts the self-adaptive distributed observer to estimate the state of the system, adopts a reinforcement learning based method to design the controller according to the estimated state, and obtains the optimal solution through an iterative method to realize the optimal control of the multi-agent system.
First part is based on single intelligent agent optimal output control of reinforcement learning
When the optimal controller of the single intelligent agent is designed, only the solution of the HJB equation needs to be learned and tracked on line by adopting a non-strategy reinforcement learning algorithm without any system dynamics knowledge
Consider the following system model:
Figure BDA0002942369580000151
where x, u are the status and control inputs of the system, respectively, and d is an external disturbance. Assuming that f (x), g (x), l (x) are Lipchitz functions, and f (0) ═ 0, the system is robust and stable.
Let p (t) be the consistency condition that needs to be achieved and satisfy the following form:
Figure BDA0002942369580000152
and h (0) ═ 0, the tracking error of the whole system is defined as:
Figure BDA0002942369580000153
simultaneous reaction of (1-1), (1-2) and (1-3) to obtain:
Figure BDA0002942369580000154
the following virtual performance outputs are defined to meet the requirements:
Figure BDA0002942369580000155
the following performance function (cost function) is defined for the system:
Figure BDA0002942369580000161
suppose the system is in u * The initial satisfaction cost is minimized, then
Figure BDA0002942369580000162
The following bellman equation is given:
Figure BDA0002942369580000163
wherein,
Figure BDA0002942369580000164
is an amplification system designed for the system.
According to the optimum conditions
Figure BDA0002942369580000165
And
Figure BDA0002942369580000166
can obtain optimal control outputEntering and optimal interference input:
Figure BDA0002942369580000167
Figure BDA0002942369580000168
wherein V * Is the optimization value function defined in (1-7).
From the optimal input conditions obtained in (1-10), the following trajectory tracking hjb (hamilton Jacobi bellman) equation can be obtained:
Figure BDA0002942369580000169
thus, the following single agent offline RL algorithm is obtained:
Figure BDA00029423695800001610
Figure BDA0002942369580000171
second part reinforcement learning based multi-agent consistency control
Firstly, graph theory:
let G ═ V, E, a be a weighted graph, which describes the information channels between N agents. V is follower node { V 1 ,v 2 ,…v N A non-empty finite set of };
Figure BDA0002942369580000172
is an edge set;
Figure BDA0002942369580000173
is a weighted adjacency matrix, and when (v) i ,v j ) E is E, a ij Is greater than 0; if it is
Figure BDA0002942369580000174
a ij 0 and for all i 1,2, … N, a ij 0. Definition of N i ={v j ∈V:(v i ,v j ) E represents follower v i Neighbor follower set of (i.e. N) i All followers in (1) can directly send information to the follower v i In the definition matrix D ═ diag (D) 1 ,d 2 ,...,d N ) Is an in-degree matrix, wherein
Figure BDA0002942369580000175
1, 2. Then, the laplacian matrix L ═ D-a ═ L of the directed graph G ═ (V, E, a) ij ]Wherein l is ij =-a ij
Figure BDA0002942369580000176
It is thus easy to know that the sum of each row of the laplace matrix L is zero, i.e. 1 N Is a right eigenvector of the laplacian matrix L, whose corresponding eigenvalue is zero. In the invention, only a simple graph is considered, only one spanning tree is provided, if only one node v is provided i Indicating that there is a directed path from one node to any other node in the graph. Although the directed graph is strongly connected, there is a directed path from each node to any other node. For graphs with spanning trees, strong connectivity is a sufficiently unnecessary condition.
Secondly, problem description:
considering a multi-agent system consisting of one leader and N followers, and considering a form with a directed graph of communication g (x), the kinetic model of the ith follower is:
Figure BDA0002942369580000177
wherein
Figure BDA0002942369580000178
And
Figure BDA0002942369580000179
respectively the status and input of the ith follower,
Figure BDA00029423695800001710
and
Figure BDA00029423695800001711
internal function and input matrix function of the ith follower, respectively, and assume f i (x i ),g i (x i ) Is unknown, has f i (0) The system (2-1) has robust stability at 0.
The leader's kinetic model was:
Figure BDA0002942369580000181
wherein
Figure BDA0002942369580000182
Is the state of the leader and is,
Figure BDA0002942369580000183
unknown, D is a constant matrix, but it is assumed that it is differentiable and bounded | | | f (x) 0 )||≤ρ 0
According to the network topological relation between each follower and the adjacent agent, the local field consistency error of the system can be described as follows;
Figure BDA0002942369580000184
wherein
Figure BDA0002942369580000185
And b is i ≧ 0, if and only if b i >0, there is communication between the ith agent and the leader. As can be seen from the above equation, the consistency information of the multi-agent systemInformation can be represented by a local domain coherence error e i To represent, when t → ∞ is reached, e i → 0, the multi-agent system will agree.
Adaptive distributed observer
In the invention, by designing the adaptive distributed observer for each follower, the problem that in a multi-agent system, in the case of unknown state of a leader, the follower can estimate the state of the leader in real time can be solved, and the state of the follower relative to the leader can be converted into the state of the adaptive distributed observer relative to the leader in the design.
The involved adaptive distributed observer is as follows:
Figure BDA0002942369580000186
Figure BDA0002942369580000187
wherein x 0 =x,
Figure BDA0002942369580000188
D 0 =D,
Figure BDA0002942369580000189
Mu is more than 0; under the error description of the system, satisfy
Figure BDA00029423695800001810
i is 1,2, …, N, and satisfies
Figure BDA00029423695800001811
This observer contains a mechanism for estimating the matrix D, hence the so-called adaptive distributed observer, which is known from (1-3) only from the leader's close-up unit, and is therefore more stable and accurate than the generic distributed observer.
Since the control law needs to utilize the designTo provide a suitable feed forward control to achieve the control objective, and the solution of the equation depends on the system matrix S of the system. Since the follower in the leader child node does not know the matrix S, it is proposed to use S i (estimated value of S) is derived from the solution of the adaptive calculation equation, so the following observer form can be obtained:
Figure BDA0002942369580000191
fourth, multi-agent system controller design based on reinforcement learning
Before the research on the consistency control of multi-agent systems, a specific dynamic model needs to be discussed, and the research on multi-agent systems is mostly based on the research on Adaptive Dynamic Programming (ADP), and most of the research on adaptive dynamic programming is performed on systems operating in Discrete Time (DT). Therefore, considering the nonlinear discrete time system first, some methods for optimally controlling the discrete time system are summarized, and then an online reinforcement learning scheme is designed for the discrete time system by combining a reinforcement learning method with a linear quadratic regulation technology (LQR). Consider the following system model:
x k+1 =f(x k )+g(x k )u k (4-1)
wherein,
Figure BDA0002942369580000192
it is the state of the system that is,
Figure BDA0002942369580000193
is the control input of the system, the system model can also use more concise x k+1 =F(x k ,u k ) And (4) showing.
For each state x of a multi-agent system k The following control strategies are defined:
u k =h(x k ) (4-2)
the above mapping form is also called as a feedback controller, and in the field of feedback control, the design of a feedback control strategy is many, including Riccati equation optimal solution, adaptive control, h ∞ control and classical frequency domain control. In the reinforcement learning scheme of the invention, the control strategy is learned in real time according to the stimulation of the environment.
To obtain the optimal control strategy for the system, the following cost function is now designed for the system:
Figure BDA0002942369580000194
wherein, the discount factor 0 is more than gamma and less than or equal to 1, u k =h(x k ) Is the control strategy in the design. It can also be given in the following more general standard quadratic form:
Figure BDA0002942369580000195
suppose the system is at V * If the cost paid is the minimum, the optimal cost policy is as follows:
Figure BDA0002942369580000201
when the optimal control strategy is taken, the optimal control value given by the system is as follows:
Figure BDA0002942369580000202
in the original system, the leader of the multi-agent system is considered to have the following model:
x k+1 =f(x k ) (4-7)
with a communication network diagram of a given system, the local coherence error of the system is defined as:
Figure BDA0002942369580000203
as can be seen from the above formula, the consensus information of the multi-agent system can be represented by the above system consensus error of the local area, i.e., when t → 0, e i → 0, the system tends to be consistent.
In order to overcome the defect that the self-adaptive dynamic planning method based on reinforcement learning strongly depends on each intelligent system model, an additional compensator is designed, can not depend on each subsystem, and is defined by an expected input affine differential equation as follows:
Figure BDA0002942369580000204
and combining the knowledge of the corresponding graph theory to obtain a global error form of (4-10):
e=L'(x-x 0 ) (4-10)
wherein,
Figure BDA0002942369580000205
and is provided with
Figure BDA0002942369580000206
Satisfy b ii =b i When i ≠ j, b ij =0
After the local error e is derived after the simultaneous operations (2-1) and (4-10), the local region consistency error is obtained relative to the graph G (x) as follows:
Figure BDA0002942369580000207
wherein, f e (t)=f(x(t))-f(x(0)),
Figure BDA0002942369580000208
L i The ith column vector of the laplacian matrix is represented. Combining (1-11) and (1-12), the local domain coherence error can be expressed as:
Figure BDA0002942369580000209
wherein,
Figure BDA0002942369580000211
and satisfies the following conditions:
Figure BDA0002942369580000212
Figure BDA0002942369580000213
likewise, returning to the system model at the beginning of the designed continuous time:
Figure BDA0002942369580000214
Figure BDA0002942369580000215
given a cost function for continuous-time multi-agent system consistency control:
Figure BDA0002942369580000216
then the correlation tracking Bellman equation can be obtained by the affine differential equations defined by (4-9) and (4-13) using the Leibniz rule:
Figure BDA0002942369580000217
where U (u) is a positive definite integrand on the control input u:
Figure BDA0002942369580000218
then (4-15) can be expressed by the following equation:
Figure BDA0002942369580000219
then, the following Hamiltonian (Hamiltonian) equation is defined:
Figure BDA00029423695800002110
does not let V stand * If the system optimal control cost is obtained, the optimal cost function is defined as follows:
Figure BDA00029423695800002111
at an optimum cost V * Next, from the Hamiltonian equation in (4-18), the following HJB (Hamilton Jacobi Bellman) equation can be obtained:
Figure BDA0002942369580000221
when the stability condition is satisfied
Figure BDA0002942369580000222
Then, the following optimal control inputs can be obtained:
Figure BDA0002942369580000223
the following strategy iteration algorithm is obtained:
Figure BDA0002942369580000224
in order to realize the optimal consistency control of the multi-agent system under the condition of unknown system dynamics, an integral reinforcement learning algorithm can be introduced under the method of the strategy iteration, and on the discrete time system (4-1), for any integral interval T >0, a value function in the continuous system (4-13) can satisfy the following form:
Figure BDA0002942369580000225
then, tracking the solution of the Bellman equation by using an integral reinforcement learning algorithm, and solving the HJB equation by using integral reinforcement chemistry can be realized under the condition that a system dynamic model is unknown.
Obtaining the following integral reinforcement learning algorithm based on strategy iteration:
Figure BDA0002942369580000231
fifthly, designing a self-adaptive distributed observer based on a reinforcement learning algorithm to realize the consistency distributed control of multiple intelligent agents
In this section, the output information of the observer system designed in the third section is combined to estimate the unmeasured state of the multi-agent system, then, based on the estimated state, an optimal controller is designed by adopting a reinforcement learning method, meanwhile, the required performance function can be selected as a non-quadratic penalty function by considering the constraint input problem, then, the communication time lag between the follower and the system and the input time lag between the follower are considered, and the consistency problem under different conditions of time lag is verified in this section.
Consider the following multi-agent system:
x i (k+1)=f i (x(k))+g i (x(k))u(k)
y i (k)=cx i (k) (5-1)
wherein x is i ,u i Y, respectively representing the status of the i-th agent of the system, the control input andand (6) outputting.
The leader model takes the following form into account:
ν(k+1)=Eν(k)
Figure BDA0002942369580000232
in the leader model referred to in the above,
Figure BDA0002942369580000233
is the state of the leader system when agent i satisfies (v) 0 ,ν i ) E epsilon (i.e. when there is a communication connection between the follower i and the leader)
Figure BDA0002942369580000241
Representing a known constant matrix. Q satisfies the condition that Q (0) is 0,
Figure BDA0002942369580000242
is an external reference signal.
Considering that the adaptive distributed observer designed in the third part has unpredictability in parameter design and is not suitable for discrete time system, a simpler and more easily designed observer is adopted in the present part:
Figure BDA0002942369580000243
Figure BDA0002942369580000244
wherein R is i (k) Represents an observed value of agent i relative to the leader at time k, and satisfies R 0 (k)=ν(k),W 0 (k)=W,
Figure BDA0002942369580000245
According to the above system description, a cost function of the system can be obtained according to the derivation of the optimal output problem formula of the linear system:
Figure BDA0002942369580000246
Figure BDA0002942369580000247
Figure BDA0002942369580000248
wherein, i is 1,2 i Is the discount factor that is to be used,
Figure BDA0002942369580000249
c=[1,0,0,…0]by solving equation (5-4), the optimal feedback input for each follower can be obtained:
Figure BDA00029423695800002410
in solving for the function optimal feedback input, (5-4) can be written in the form of a quadratic function, expressed as a systematic value function:
Figure BDA00029423695800002411
the following bellman equation is obtained:
Figure BDA00029423695800002412
this is a consistency equation satisfying a value function, and according to the above bellman equation, the HJB equation in the nonlinear optimal feedback can be defined as:
Figure BDA00029423695800002413
when it is stabilizedSexual conditions
Figure BDA0002942369580000251
Then, the following optimal control inputs can be obtained:
Figure BDA0002942369580000252
wherein,
Figure BDA0002942369580000253
because the HJB equation is difficult to solve, the strategy iteration of the IRL is adopted in the algorithm to solve the HJB equation.
Obtaining the following strategy iteration-based online IRL multi-agent optimal feedback control algorithm:
Figure BDA0002942369580000254
now, the problem of communication time lag between the system follower and the leader and the problem of input time lag between the followers are considered, and the consistency problem under different time lags is verified in this section.
On the basis of the original system (5-1) and (5-2) models, the following first-order multi-agent system is considered:
Figure BDA0002942369580000255
wherein,
Figure BDA0002942369580000256
respectively representing the status and control inputs of the ith agent at time k. Tau. ij ≧ 0 denotes communication time lag, τ, of data from agent j to agent i i And more than or equal to 0 represents the input time lag of the agent i. Consider a first-order discrete multi-agent system comprising n agents, the network topology of which is a static directed weighted graph and which comprises a globally reachable node if satisfied
Figure BDA0002942369580000257
Then there are: max { d { i (2τ i +1)}<1
The system is able to achieve a gradual agreement in which,
Figure BDA0002942369580000261
suppose a multi-agent system comprises 5 nodes, the connection topology of which is shown in FIG. 1, which is a directed weighted graph with node 1 being globally reachable. From the network topology of fig. 1, the corresponding adjacency matrix can be obtained as follows:
Figure BDA0002942369580000262
according to the setting, the input time lag of the intelligent agent should satisfy
Figure BDA0002942369580000263
Now, the following assumptions are made: tau is 13 =1s,τ 21 =0.75s,τ 32 =1.8s,τ 42 =2s,τ 51 When the input time lag τ is 0.5s and the initial state of the agent is randomly generated as x (0) — (2.5,3,2,3.5,5), the simulation result is shown in fig. 2, and it can be seen that the agents eventually gradually agree. Then, the input skew is changed to 3s, and the system can still achieve coincidence.
Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (1)

1. A model unknown multi-agent consistency control method based on reinforcement learning is characterized in that: the method comprises the following steps:
s1: performing optimal output control of the single agent based on reinforcement learning;
s2: multi-agent consistency control based on reinforcement learning;
the S1 specifically includes:
when designing the optimal controller of a single intelligent agent, a non-strategy reinforcement learning algorithm is adopted to learn and track the solution of the HJB equation on line, and the following system models are considered:
Figure FDA0003791058280000011
where x, u are the status and control inputs of the system, respectively, and d is external interference; assuming that f (x), g (x), l (x) are Lipchitz functions, and f (0) ═ 0, the system is robust and stable;
assume that p (t) is a condition of consistency that needs to be achieved and satisfies the following form:
Figure FDA0003791058280000012
and h (0) is equal to 0, the tracking error of the whole system is defined as:
Figure FDA0003791058280000013
simultaneous reaction of (1-1), (1-2) and (1-3) to obtain:
Figure FDA0003791058280000014
the following virtual performance outputs are defined to meet the requirements:
Figure FDA0003791058280000015
defining a performance function for the system:
Figure FDA0003791058280000016
suppose the system is in u * The initial satisfaction cost is minimized, then
Figure FDA0003791058280000017
The following bellman equation is given:
Figure FDA0003791058280000018
wherein,
Figure FDA0003791058280000019
is an amplification system designed for the system;
according to the optimum conditions
Figure FDA0003791058280000021
And
Figure FDA0003791058280000022
obtaining an optimal control input and an optimal interference input:
Figure FDA0003791058280000023
Figure FDA0003791058280000024
wherein V * Is the optimized value function defined in (1-7);
and (4) obtaining the following HJB equation of track tracking according to the optimal input conditions obtained in the step (1-10):
Figure FDA0003791058280000025
thus, the following single agent offline RL algorithm is obtained:
solving HJB equation based on RL algorithm
Step 11: initialization: given an allowable stability control strategy value u 0
Step 12: and (3) policy evaluation: for a control input u i And a disturbance input d i The following Bellman equation was used:
Figure FDA0003791058280000026
Figure FDA0003791058280000027
step 13: interference d of updating system i
Figure FDA0003791058280000028
Step 14: updating input u of system i
Figure FDA0003791058280000029
Step 15: re-execution of step 11
The S2 specifically includes:
s21: establishing a graph theory:
let G ═ V, E, a be a weighted graph, which describes the information channels between N agents; v is follower node { V 1 ,v 2 ,…v N A non-empty finite set of };
Figure FDA0003791058280000031
is an edge set;
Figure FDA0003791058280000032
is a weighted adjacency matrix, and when (v) i ,v j ) E is epsilon, a ij Is greater than 0; if it is
Figure FDA0003791058280000033
a ij 0 and for all i 1,2, … N, a ij 0; definition of N i ={v j ∈V:(v i ,v j ) E represents follower v i Neighbor follower set of (i.e. N) i All followers in (1) send information directly to the follower v i Defining the matrix D ═ diag (D) 1 ,d 2 ,...,d N ) Is an in-degree matrix, wherein
Figure FDA0003791058280000034
Laplacian matrix L ═ D-a ═ L for directed graph G ═ (V, E, a) ij ]Wherein l is ij =-a ij
Figure FDA0003791058280000035
The sum of each row of the laplace matrix L is zero, i.e. 1 N A right eigenvector of the Laplace matrix L, the corresponding eigenvalue of which is zero; for a spanning tree, if there is only one node v i Indicating that there is a directed path from one node to any other node in the graph; there is a directed path from each node to any other node; for graphs with spanning trees, strong connectivity is a sufficiently unnecessary condition;
s22: problem description:
considering a multi-agent system consisting of one leader and N followers, and considering a form with a directed graph of communication g (x), the kinetic model of the ith follower is:
Figure FDA0003791058280000036
wherein
Figure FDA0003791058280000037
And
Figure FDA0003791058280000038
respectively the status and input of the ith follower,
Figure FDA0003791058280000039
and
Figure FDA00037910582800000310
internal function and input matrix function of the ith follower, respectively, and assume f i (x i ),g i (x i ) Is unknown, has f i (0) The system (2-1) has robust stability as 0;
the leader's kinetic model was:
Figure FDA00037910582800000311
wherein
Figure FDA00037910582800000312
In the form of the status of the leader,
Figure FDA00037910582800000313
unknown, D is a constant matrix, which is differentiable, and bounded | | | f (x) 0 )||≤ρ 0
According to the network topological relation between each follower and the adjacent agent, the local domain consistency error of the system is described as follows:
Figure FDA00037910582800000314
wherein
Figure FDA00037910582800000315
And b is i ≧ 0, if and only if b i When the number is more than 0, the ith agent and the leader have communication; consistency information of multi-agent system is composed of consistency error e of local area i When t → ∞, e i → 0, the multi-agent system will agree;
s23: adaptive distributed observer
By designing the adaptive distributed observer for each follower, the problem that in a multi-agent system, the follower estimates the state of a leader in real time under the condition that the state of the leader is unknown is solved, and the state of the follower relative to the leader is converted into the state of the adaptive distributed observer relative to the leader;
wherein, the self-adaptive distributed observer is as follows:
Figure FDA0003791058280000041
Figure FDA0003791058280000042
wherein x 0 =x,
Figure FDA0003791058280000043
D 0 =D,
Figure FDA0003791058280000044
Mu is more than 0; under the error description of the system, satisfy
Figure FDA0003791058280000045
And satisfy
Figure FDA0003791058280000046
The adaptive distributed observer comprises a mechanism for estimating a matrix D, which is known only by the leader's approach unit;
using estimated values S of S i From the solution of the adapted calculation equation, the following observer form is obtained:
Figure FDA0003791058280000047
s24: designing a multi-agent system controller based on reinforcement learning;
consider the following system model:
x k+1 =f(x k )+g(x k )u k (4-1)
wherein,
Figure FDA0003791058280000048
it is the state of the system that is,
Figure FDA0003791058280000049
is the control input of the system, the system model can also use more concise x k+1 =F(x k ,u k ) Represents;
for each state x of a multi-agent system k The following control strategies are defined:
u k =h(x k ) (4-2)
the mapping form is also called as a feedback controller, and in the field of feedback control, feedback control strategies are designed greatly and comprise Riccati equation optimal solution, adaptive control, h-infinity control and classical frequency domain control;
in order to obtain the optimal control strategy of the system, the following cost function is designed for the system:
Figure FDA0003791058280000051
wherein, the discount factor 0 is more than gamma and less than or equal to 1, u k =h(x k ) Is a control strategy in the design;
or given in standard quadratic form:
Figure FDA0003791058280000052
suppose the system is at V * If the cost paid is the minimum, the optimal cost policy is as follows:
Figure FDA0003791058280000053
when the optimal control strategy is taken, the optimal control value given by the system is as follows:
Figure FDA0003791058280000054
in the original system, the leader of the multi-agent system is considered to have the following model:
x k+1 =f(x k ) (4-7)
with a communication network diagram of a given system, the local coherence error of the system is defined as:
Figure FDA0003791058280000055
consensus information for multi-agent systems is represented by the above-mentioned systematic consensus error for the local domain, i.e., when t → 0, e i → 0, indicating that the system is about uniform;
designing an additional compensator, independent of the individual subsystems, defined by the input affine differential equation that is expected:
Figure FDA0003791058280000056
and combining the knowledge of the corresponding graph theory to obtain a global error form of (4-10):
e=L'(x-x 0 ) (4-10)
wherein,
Figure FDA0003791058280000057
and is provided with
Figure FDA0003791058280000058
Satisfy b ii =b i When i ≠ j, b ij =0
After the local error e is derived after the simultaneous operations (2-1) and (4-10), the local region consistency error is obtained relative to the graph G (x) as follows:
Figure FDA0003791058280000061
wherein f is e (t)=f(x(t))-f(x(0)),
Figure FDA0003791058280000062
L i The ith column vector of the Laplace matrix is represented; in combination with (4-10) and (4-11), the local domain coherence error is expressed as:
Figure FDA0003791058280000063
wherein,
Figure FDA0003791058280000064
and satisfies the following conditions:
Figure FDA0003791058280000065
Figure FDA0003791058280000066
likewise, return to the system model at the continuous time as originally designed:
Figure FDA0003791058280000067
Figure FDA0003791058280000068
given a cost function for continuous-time multi-agent system consistency control:
Figure FDA0003791058280000069
then the correlation tracking Bellman equation is obtained by the affine differential equations defined by (4-9) and (4-13) using the Leibniz rule:
Figure FDA00037910582800000610
where U (u) is a positive definite integrand on the control input u:
Figure FDA00037910582800000611
then (4-15) is expressed by the following equation:
Figure FDA00037910582800000612
then, the following Hamiltonian equation is defined:
Figure FDA00037910582800000613
Figure FDA0003791058280000071
does not let V stand * If the system optimal control cost is obtained, the optimal cost function is defined as follows:
Figure FDA0003791058280000072
at an optimum cost V * Next, according to the Hamiltonian equation in (4-18), the following HJB equation is obtained:
Figure FDA0003791058280000073
when the stability condition is satisfied
Figure FDA0003791058280000074
Then, the following optimal control inputs are obtained:
Figure FDA0003791058280000075
the following strategy iteration algorithm is obtained:
the algorithm is as follows: HJB equation solving method based on strategy iteration method
Step 211: and (3) policy evaluation: given a control input u i (x) Solving for V by the Bellman equation i (X)
Figure FDA0003791058280000076
Step 212: (strategy improvement) the control strategy is updated by:
Figure FDA0003791058280000077
step 213: order to
Figure FDA0003791058280000078
Returning to step 211 until convergence to a minimum value;
an integral reinforcement learning algorithm is introduced into a strategy iterative algorithm, and on a discrete time system (4-1), for any integral interval T >0, a value function in a continuous system (4-13) meets the following form:
Figure FDA0003791058280000079
tracking the solution of the Bellman equation by using an integral reinforcement learning algorithm, and solving the HJB equation by using integral reinforcement chemistry under the condition that a system dynamic model is unknown;
obtaining the following integral reinforcement learning algorithm based on strategy iteration:
the algorithm is as follows: strategy iteration-based HJB equation solving method by offline integral reinforcement learning algorithm
Step 221: and (3) policy evaluation: given control input u i (x) Solving for V by the Bellman equation i (X)
Figure FDA0003791058280000081
Step 222: strategy improvement: the control strategy is updated by:
Figure FDA0003791058280000082
step 223: order to
Figure FDA0003791058280000083
Returning to step 221 until a minimum value is converged;
s25: an adaptive distributed observer is designed based on a reinforcement learning algorithm to realize a multi-agent consistency distributed control multi-agent system:
x i (k+1)=f i (x(k))+g i (x(k))u i (k)
y i (k)=cx i (k) (5-1)
wherein x is i ,u i ,y i Respectively representing the state of the ith agent of the system, and controlling input and output;
the leader model considers the form as u ():
ν(k+1)=Eν(k)
Figure FDA0003791058280000084
in the leader model referred to herein,
Figure FDA0003791058280000085
is the state of the leader system when agent i satisfies (v) 0 ,ν i ) E epsilon, when a communication connection exists between the follower i and the leader,
Figure FDA0003791058280000086
representing a known constant matrix; q satisfies the condition that Q (0) is 0,
Figure FDA0003791058280000087
is an external reference signal;
the observer is arranged:
Figure FDA0003791058280000088
Figure FDA0003791058280000089
wherein R is i (k) Represents an observed value of agent i relative to the leader at time k, and satisfies R 0 (k)=ν(k),W 0 (k)=W,
Figure FDA0003791058280000091
According to the system description, a cost function of the system is obtained according to the derivation of the optimal output problem formula of the linear system:
Figure FDA0003791058280000092
Figure FDA0003791058280000093
Figure FDA0003791058280000094
wherein, i is 1,2 i Is the discount factor that is to be used,
Figure FDA0003791058280000095
c=[1,0,0,...0]the optimal feedback input for each folower is obtained by solving equation (5-4):
Figure FDA0003791058280000096
when solving for the function optimal feedback input, (5-4) is written as a quadratic function, expressed as a systematic value function:
Figure FDA0003791058280000097
the following bellman equation is obtained:
Figure FDA0003791058280000098
according to the above bellman equation, the HJB equation in nonlinear optimal feedback is defined as:
Figure FDA0003791058280000099
when the stability condition is satisfied
Figure FDA00037910582800000910
Then, the following optimal control inputs are obtained:
Figure FDA00037910582800000911
wherein,
Figure FDA00037910582800000912
solving the HJB equation by adopting strategy iteration of IRL;
obtaining the following strategy iteration-based online IRL multi-agent optimal feedback control algorithm:
the algorithm is as follows: online IRL algorithm solution HJB equation based on strategy iteration
Step 231: initialization: selecting a control input
Figure FDA0003791058280000101
Repeating the following steps until the system converges;
Figure FDA0003791058280000102
step 232: strategy improvement: the control strategy is updated by:
Figure FDA0003791058280000103
step 233: let u be i (k)=u i+1 (k) Return to step 231 until V i (k) Convergence to a minimum value;
on the basis of the system (5-1) and (5-2) models, consider the following first order multi-agent system:
Figure FDA0003791058280000104
wherein,
Figure FDA0003791058280000105
respectively representing the state and control input of the ith agent at time k; tau is ij ≧ 0 denotes the communication time lag, τ, for data from agent j to agent i i The input time lag of the agent i is more than or equal to 0; consider a first-order discrete multi-agent system comprising n agents, the network topology of which is a static directed weighted graph and which comprises a globally reachable node if satisfied
Figure FDA0003791058280000106
Then there are: max { d { i (2τ i +1)}<1
The system is able to achieve a gradual agreement in which,
Figure FDA0003791058280000107
assume that a multi-agent system contains 5 nodes, and their corresponding adjacency matrices are as follows:
Figure FDA0003791058280000108
according to the setting, the input time lag of the intelligent agent should satisfy
Figure FDA0003791058280000109
Suppose that: tau is 13 =1s,τ 21 =0.75s,τ 32 =1.8s,τ 42 =2s,τ 51 When the input time lag tau is 0.8s and the initial state of the agent is randomly generated to be x (0) x (2.5,3,2,3.5,5), the agents finally and gradually get consistent; the input time lag is changed to 3s, and the system still achieves consistency.
CN202110184288.2A 2021-02-08 2021-02-08 Model unknown multi-agent consistency control method based on reinforcement learning Expired - Fee Related CN112947084B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110184288.2A CN112947084B (en) 2021-02-08 2021-02-08 Model unknown multi-agent consistency control method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110184288.2A CN112947084B (en) 2021-02-08 2021-02-08 Model unknown multi-agent consistency control method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN112947084A CN112947084A (en) 2021-06-11
CN112947084B true CN112947084B (en) 2022-09-23

Family

ID=76245480

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110184288.2A Expired - Fee Related CN112947084B (en) 2021-02-08 2021-02-08 Model unknown multi-agent consistency control method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN112947084B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113359476B (en) * 2021-07-09 2022-09-16 广东华中科技大学工业技术研究院 Consistency control algorithm design method of multi-agent system under discrete time
CN113792843B (en) * 2021-08-19 2023-07-25 中国人民解放军军事科学院国防科技创新研究院 Congestion emergence control method based on group direction consistency and stability under behavioral cloning framework
CN113848703B (en) * 2021-08-28 2023-12-08 同济大学 Multi-agent system state estimation method
CN113900380B (en) * 2021-11-17 2023-02-28 北京航空航天大学 Robust output formation tracking control method and system for heterogeneous cluster system
CN114371620B (en) * 2021-12-22 2023-08-29 同济大学 High-order nonlinear heterogeneous multi-agent consistency controller design method and device
CN114547980B (en) * 2022-02-24 2024-06-07 重庆大学 Multi-agent finite time event trigger control method with time-varying state constraint
CN114867024A (en) * 2022-04-22 2022-08-05 广州大学 Control method for information security transmission of complex monitoring network of marine smart fishery
CN114784828B (en) * 2022-05-17 2024-08-16 上海交通大学 Electric power system transient-steady state frequency optimal control method based on reinforcement learning
CN114706359B (en) * 2022-06-06 2022-08-26 齐鲁工业大学 Agricultural multi-agent system consistency distributed control method based on sampling data
CN114935944A (en) * 2022-06-22 2022-08-23 南京航空航天大学 Fixed-wing unmanned aerial vehicle longitudinal control method based on output feedback Q learning
CN115268275B (en) * 2022-08-24 2024-05-28 广东工业大学 Multi-agent system consistency tracking method and system based on state observer
CN116149178B (en) * 2022-12-08 2023-09-26 哈尔滨理工大学 Networked prediction control method based on amplifying-forwarding repeater
CN115877718B (en) * 2023-02-23 2023-05-30 北京航空航天大学 Data-driven heterogeneous missile formation switching communication topology cooperative control method
CN116594301A (en) * 2023-05-16 2023-08-15 东北大学秦皇岛分校 Intermittent pinning synchronous control method of uncertain heterogeneous network
CN116560236B (en) * 2023-05-29 2024-08-27 东北大学秦皇岛分校 Distributed consistency control method with energy minimization taking control input change into consideration

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011144382A1 (en) * 2010-05-17 2011-11-24 Technische Universität München Hybrid oltp and olap high performance database system
CN105847438A (en) * 2016-05-26 2016-08-10 重庆大学 Event trigger based multi-agent consistency control method
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought
CN107179777A (en) * 2017-06-03 2017-09-19 复旦大学 Multiple agent cluster Synergistic method and multiple no-manned plane cluster cooperative system
CN107589672A (en) * 2017-09-27 2018-01-16 三峡大学 The intelligent power generation control method of isolated island intelligent power distribution virtual wolf pack control strategy off the net
CN110018687A (en) * 2019-04-09 2019-07-16 大连海事大学 Unmanned water surface ship optimal track following control method based on intensified learning method
CN110083063A (en) * 2019-04-29 2019-08-02 辽宁石油化工大学 A kind of multiple body optimal control methods based on non-strategy Q study
CN110308659A (en) * 2019-08-05 2019-10-08 沈阳航空航天大学 Uncertain multi-agent system mixing with time delay and switching topology triggers consistent control method
CN110782011A (en) * 2019-10-21 2020-02-11 辽宁石油化工大学 Networked multi-agent system distributed optimization control method based on reinforcement learning
CN110780668A (en) * 2019-04-09 2020-02-11 北京航空航天大学 Distributed formation surround tracking control method and system for multiple unmanned boats
CN111531538A (en) * 2020-05-08 2020-08-14 哈尔滨工业大学 Consistency control method and device for multi-mechanical arm system under switching topology
CN111694365A (en) * 2020-07-01 2020-09-22 武汉理工大学 Unmanned ship formation path tracking method based on deep reinforcement learning
CN111722531A (en) * 2020-05-12 2020-09-29 天津大学 Online model-free optimal control method for switching linear system
CN111948937A (en) * 2020-07-20 2020-11-17 电子科技大学 Multi-gradient recursive reinforcement learning fuzzy control method and system of multi-agent system
CN112052585A (en) * 2020-09-02 2020-12-08 南京邮电大学 Design method of distributed dimensionality reduction observer of linear time invariant system
CN112180730A (en) * 2020-10-10 2021-01-05 中国科学技术大学 Hierarchical optimal consistency control method and device for multi-agent system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9134707B2 (en) * 2012-03-30 2015-09-15 Board Of Regents, The University Of Texas System Optimal online adaptive controller
CN107728471A (en) * 2017-09-01 2018-02-23 南京理工大学 For a kind of packet uniformity control method for mixing heterogeneous multi-agent system
US10734811B2 (en) * 2017-11-27 2020-08-04 Ihi Inc. System and method for optimal control of energy storage system
CN109946975B (en) * 2019-04-12 2020-04-24 北京理工大学 Reinforced learning optimal tracking control method of unknown servo system
US11674384B2 (en) * 2019-05-20 2023-06-13 Schlumberger Technology Corporation Controller optimization via reinforcement learning on asset avatar
CN111880567B (en) * 2020-07-31 2022-09-16 中国人民解放军国防科技大学 Fixed-wing unmanned aerial vehicle formation coordination control method and device based on deep reinforcement learning

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011144382A1 (en) * 2010-05-17 2011-11-24 Technische Universität München Hybrid oltp and olap high performance database system
CN105847438A (en) * 2016-05-26 2016-08-10 重庆大学 Event trigger based multi-agent consistency control method
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought
CN107179777A (en) * 2017-06-03 2017-09-19 复旦大学 Multiple agent cluster Synergistic method and multiple no-manned plane cluster cooperative system
CN107589672A (en) * 2017-09-27 2018-01-16 三峡大学 The intelligent power generation control method of isolated island intelligent power distribution virtual wolf pack control strategy off the net
CN110780668A (en) * 2019-04-09 2020-02-11 北京航空航天大学 Distributed formation surround tracking control method and system for multiple unmanned boats
CN110018687A (en) * 2019-04-09 2019-07-16 大连海事大学 Unmanned water surface ship optimal track following control method based on intensified learning method
CN110083063A (en) * 2019-04-29 2019-08-02 辽宁石油化工大学 A kind of multiple body optimal control methods based on non-strategy Q study
CN110308659A (en) * 2019-08-05 2019-10-08 沈阳航空航天大学 Uncertain multi-agent system mixing with time delay and switching topology triggers consistent control method
CN110782011A (en) * 2019-10-21 2020-02-11 辽宁石油化工大学 Networked multi-agent system distributed optimization control method based on reinforcement learning
CN111531538A (en) * 2020-05-08 2020-08-14 哈尔滨工业大学 Consistency control method and device for multi-mechanical arm system under switching topology
CN111722531A (en) * 2020-05-12 2020-09-29 天津大学 Online model-free optimal control method for switching linear system
CN111694365A (en) * 2020-07-01 2020-09-22 武汉理工大学 Unmanned ship formation path tracking method based on deep reinforcement learning
CN111948937A (en) * 2020-07-20 2020-11-17 电子科技大学 Multi-gradient recursive reinforcement learning fuzzy control method and system of multi-agent system
CN112052585A (en) * 2020-09-02 2020-12-08 南京邮电大学 Design method of distributed dimensionality reduction observer of linear time invariant system
CN112180730A (en) * 2020-10-10 2021-01-05 中国科学技术大学 Hierarchical optimal consistency control method and device for multi-agent system

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
Data-Driven Reinforcement Learning Design for Multi-agent Systems with Unknown Disturbances;Xiangnan Zhong等;《2018 International Joint Conference on Neural Networks》;20181015;全文 *
Optimal tracking agent: a new framework of reinforcement learning for multiagent systems;Cao, WH等;《CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE》;20130925;第25卷(第14期);全文 *
Reinforcement Learning Control for Consensus of the Leader-Follower Multi-Agent Systems;Chiang, ML等;《IEEE 7th Data Driven Control and Learning Systems Conference》;20181231;全文 *
Resilient adaptive optimal control of distributed multi-agent systems using reinforcement learning;Moghadam, R等;《IET CONTROL THEORY AND APPLICATIONS》;20181106;第12卷(第16期);全文 *
动态约束下可重构模块机器人分散强化学习最优控制;董博等;《吉林大学学报》;20141231;第44卷(第5期);全文 *
基于强化学习的数据驱动多智能体系统最优一致性综述;李金娜等;《智能科学与技术学报》;20201231;第2卷(第4期);全文 *
基于观测器的多智能体系统一致性控制与故障检测;陈刚等;《控制理论与应用》;20140531;第31卷(第5期);全文 *
无模型自适应动态规划及其在多智能体协同控制中的应用;杨永亮;《中国博士学位论文全文数据库 信息科技辑》;20180315(第03期);全文 *

Also Published As

Publication number Publication date
CN112947084A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
CN112947084B (en) Model unknown multi-agent consistency control method based on reinforcement learning
Peng et al. Data-driven optimal tracking control of discrete-time multi-agent systems with two-stage policy iteration algorithm
Liu et al. Adaptive neural output feedback tracking control for a class of uncertain discrete-time nonlinear systems
Zhao et al. Distributed adaptive fixed-time consensus tracking for second-order multi-agent systems using modified terminal sliding mode
CN112180734A (en) Multi-agent consistency method based on distributed adaptive event triggering
CN113900380B (en) Robust output formation tracking control method and system for heterogeneous cluster system
CN114237041B (en) Space-ground cooperative fixed time fault tolerance control method based on preset performance
Lu et al. On robust control of uncertain chaotic systems: a sliding-mode synthesis via chaotic optimization
Song et al. Multi-objective optimal control for a class of unknown nonlinear systems based on finite-approximation-error ADP algorithm
Liu et al. Data-driven bipartite consensus tracking for nonlinear multiagent systems with prescribed performance
CN114063438B (en) Data-driven multi-agent system PID control protocol self-learning method
CN114626307B (en) Distributed consistent target state estimation method based on variational Bayes
CN110362103B (en) Distributed autonomous underwater vehicle attitude collaborative optimization control method
Gao et al. Tracking control of the nodes for the complex dynamical network with the auxiliary links dynamics
Song et al. Adaptive dynamic programming: single and multiple controllers
CN114498612A (en) Distributed multi-region power transmission network collaborative economic dispatching method and computing device
Rao et al. Optimal control of nonlinear system based on deterministic policy gradient with eligibility traces
CN113095513A (en) Double-layer fair federal learning method, device and storage medium
Wang et al. Suboptimal leader-to-coordination control for nonlinear systems with switching topologies: A learning-based method
Martinez-Piazuelo et al. Distributed Nash equilibrium seeking in strongly contractive aggregative population games
CN107491841A (en) Nonlinear optimization method and storage medium
Wang et al. Decentralized adaptive consensus control of uncertain nonlinear systems under directed topologies
JPH07200512A (en) 1optimization problems solving device
Han et al. Prescribed time bipartite output consensus tracking for heterogeneous multi-agent systems with external disturbances
CN113359474B (en) Extensible distributed multi-agent consistency control method based on gradient feedback

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220923