CN112947084B - Model unknown multi-agent consistency control method based on reinforcement learning - Google Patents
Model unknown multi-agent consistency control method based on reinforcement learning Download PDFInfo
- Publication number
- CN112947084B CN112947084B CN202110184288.2A CN202110184288A CN112947084B CN 112947084 B CN112947084 B CN 112947084B CN 202110184288 A CN202110184288 A CN 202110184288A CN 112947084 B CN112947084 B CN 112947084B
- Authority
- CN
- China
- Prior art keywords
- agent
- control
- following
- equation
- optimal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 62
- 238000000034 method Methods 0.000 title claims abstract description 54
- 239000011159 matrix material Substances 0.000 claims abstract description 45
- 230000003044 adaptive effect Effects 0.000 claims abstract description 33
- 239000003795 chemical substances by application Substances 0.000 claims description 133
- 230000006870 function Effects 0.000 claims description 44
- 238000004422 calculation algorithm Methods 0.000 claims description 43
- 238000011217 control strategy Methods 0.000 claims description 26
- 238000004891 communication Methods 0.000 claims description 22
- 238000013461 design Methods 0.000 claims description 17
- 238000011156 evaluation Methods 0.000 claims description 7
- 230000006872 improvement Effects 0.000 claims description 7
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 6
- 238000010586 diagram Methods 0.000 claims description 5
- 238000013459 approach Methods 0.000 claims description 4
- 238000009795 derivation Methods 0.000 claims description 4
- 230000009897 systematic effect Effects 0.000 claims description 4
- 230000003321 amplification Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 3
- 238000012887 quadratic function Methods 0.000 claims description 3
- 230000003068 static effect Effects 0.000 claims description 3
- 238000005259 measurement Methods 0.000 abstract description 3
- 206010063385 Intellectualisation Diseases 0.000 abstract description 2
- 238000011160 research Methods 0.000 description 12
- 230000008901 benefit Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000004088 simulation Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000003014 reinforcing effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000638 stimulation Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/02—Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The invention relates to a model unknown multi-agent consistency control method based on reinforcement learning, and belongs to the field of intellectualization. The scheme adopted in the invention firstly when designing the self-adaptive distributed observer consists of three steps. First, an adaptive distributed observer was designed to estimate the state of the system matrix and the leader system. Secondly, a method for calculating the equation solution of the observer on line is provided after the adaptive distributed observer is designed. Thirdly, in order to eliminate few extreme cases, under the condition that each follower does not know the matrix of the leader system, the problem of distributed consistency output adjustment of the system is solved by integrating adaptive state feedback and adaptive measurement output feedback control. According to the estimated state, the controller is designed by adopting a reinforcement learning-based method, and the optimal solution is obtained by an iteration method, so that the optimal control of the multi-agent system is realized.
Description
Technical Field
The invention belongs to the field of intellectualization, and relates to a model unknown multi-agent consistency control method based on reinforcement learning.
Background
The research on the problem of consistency control of multi-agent systems dates back to the 80 s of the last century, and the research on related multi-agent technologies began at the earliest from the research on mobile robots. The field of multi-agent system consistency control research has developed rapidly over the last fifteen years, and many new systems have been proposed that have extended from military operations to mobile sensor networks, commercial highways, air transportation, and emergency and disaster relief. However, with the constraint of control quality, the distributed optimal consistency problem is always a great challenge in the control field nowadays. The distributed consistency of a multi-agent system not only needs to satisfy the consistency of the behavior of each agent, but also needs to optimize the performance index of the whole system. In a more rigorous sense, distributed consistency control of multi-agent systems is to be agreed upon at as low a cost as possible. However, the mainstream outstanding learner for researching multi-agent control has given various ideas aiming at the problem of consistency control of multi-agent systems: such as linear quadratic regulation techniques, adaptive learning methods, model predictive control techniques, and fuzzy adaptive dynamic programming.
In recent decades, Reinforcement Learning (RL) has been a control protocol that can be designed without knowledge or identification of system dynamics, and is not model-based, and thus has gained much attention and broad application prospects. Reinforcement learning is motivated by biological systems to find the optimal control strategy by optimizing cumulative rewards, interacting with a given unknown environment to learn the best strategy to maximize its long-term performance. The Rl algorithm is based on the fact that a successful control strategy should be remembered, and then by reinforcing this signal, they are more likely to be used a second time. From the beginning of reinforcement learning research, the RL method has gained much attention in the field of intelligent agent research. The research on the relevant reinforcement learning which is mainstream at present is usually realized on an operator-critic structure, a critic evaluates the performance of the current strategy according to measured data, and a performer finds an improved strategy by using the strategy evaluated by the critic. Compared with a classical dynamic planning method, the reinforcement learning method provides a feasible method capable of avoiding dimension explosion. On the other hand, compared with the traditional adaptive controller, the reinforcement learning method only needs to consider the dynamics of the tracking error, can reduce the transient response bringing errors to the system to the maximum extent, and simultaneously ensures the stability of the whole system. The main advantage of the Reinforcement Learning (RL) algorithm to solve the optimal control problem is that it can obtain enough data information from the system without having to solve the system dynamics, and then iteratively improve between two steps of policy evaluation and policy improvement based on a policy iteration technique.
In the research on multi-agent consistency control, the system is partially unknown, and the follower can observe the state of the leader at any time, so that the whole system achieves behavior consistency by constructing a communication network between the leader and the follower and a communication network between the followers. In most cases, the state of the system cannot be directly measured by the sensor, but the input and output of the system can be measured by various methods. The popular method of today is to make an estimate of the state of the system by constructing a full-dimensional observer; such as considering a simple linear system as follows:
an analog linear system identical to this system was constructed at the same time:
where ω and γ are the input and output of the simulation system, and are also estimates of the original system, designing the simulation system andthe error value of the original system is e-omega-x, in order to make the state estimation error e tend to 0, the state estimation error e can be converted into an output estimation error gamma-y which can be measured, according to the general principle of feedback control, only the output estimation error gamma-y is required to be fed back to the state of the analog systemThe controller is then designed so that the output estimation error approaches 0, i.e. the state estimation error also approaches 0 at this time. And introducing a state observer to output a feedback matrix H to obtain the following form:
the output equation of the original system and the output equation of the full-dimensional state observer are brought into the state equation of the full-dimensional state observer to obtain:
where A-HC is the system matrix of the full-dimensional observer. The key problem in designing a full-dimensional state observer is to ensure that the state estimation error approaches 0 under any initial condition, i.e. the method is applied to the field of view
The method can be obtained by an original system and a full-dimensional observer system:
solving the above equation to obtain
From the solution, x (t) 0 )≠ω(t 0 );
When x (t) is ω (t), it is always true, and when x (t) 0 )=ω(t 0 ) Then, the feedback matrix H only needs to be adjusted so that the eigenvalues of the a-HC matrix have negative real parts.
In a multi-agent system, a state value is difficult to observe accurately in real time, a corresponding self-adaptive distributed observer is designed for each follower in the design, and then under the guidance of an output regulation theory, a collaborative self-adaptive optimal output regulation problem is decomposed into a feedforward control design problem and a self-adaptive optimal feedback control problem, wherein the feedforward control design problem and the self-adaptive optimal feedback control problem can solve nonlinear regulator equations. And finally, establishing an HJB (Hamilton Jacobi Bellman) equation of the established cost function by establishing the cost function of the multi-agent system based on reinforcement learning, solving the HJB equation by a method based on synchronous reinforcement learning, and finally obtaining an optimal solution by an iterative method to realize the optimal control of the multi-agent system.
Disclosure of Invention
In view of the above, the present invention provides a model-unknown multi-agent consistency control method based on reinforcement learning. The optimal feedback control problem of the multi-agent and the design problem of the optimal controller under the conditions that the leader model is unknown and the follower state is not measurable are solved by establishing a model-state-based multi-agent input structure and designing a corresponding reinforcement learning algorithm to solve an HJB equation.
In order to achieve the purpose, the invention provides the following technical scheme:
a model unknown multi-agent consistency control method based on reinforcement learning comprises the following steps:
s1: performing single agent optimal output control based on reinforcement learning;
s2: multi-agent consistency control based on reinforcement learning.
Optionally, the S1 specifically includes:
when the optimal controller of a single intelligent agent is designed, a non-strategy reinforcement learning algorithm is adopted to learn and track the solution of the HJB equation on line, and the following system models are considered:
where x, u are the status and control inputs of the system, respectively, and d is the external disturbance; assuming that f (x), g (x), l (x) are Lipchitz functions, and f (0) ═ 0, the system is robust and stable;
assume that p (t) is a condition of consistency that needs to be achieved and satisfies the following form:
and h (0) is equal to 0, the tracking error of the whole system is defined as:
simultaneous reaction of (1-1), (1-2) and (1-3) to obtain:
the following virtual performance outputs are defined to meet the requirements:
defining a performance function for the system:
suppose the system is in u * The initial satisfaction cost is minimized, then
The following bellman equation is given:
wherein,
is an amplification system designed for the system;
according to the optimization conditionsAndobtaining an optimal control input and an optimal interference input:
wherein V * Is the optimized value function defined in (1-7);
and (4) obtaining the following HJB equation of track tracking according to the optimal input conditions obtained in the step (1-10):
thus, the following single agent offline RL algorithm is obtained:
solving HJB equation based on RL algorithm
Step 11: initialization: given an allowable stability control strategy value u 0
Step 12: and (3) policy evaluation: for oneA control input u i And a disturbance input d i The following Bellman equation was used:
step 13: interference d of updating system i :
Step 14: updating input u of system i :
Step 15: step 11 is re-executed.
Optionally, the S2 specifically includes:
s21: establishing a graph theory:
let G ═ (V, E, a) be a weighted graph, which describes the information channels between N agents; v is follower node { V 1 ,v 2 ,…v N A non-empty finite set of };is an edge set;is a weighted adjacency matrix, and when (v) i ,v j ) E is epsilon, a ij Is greater than 0; if it isa ij 0 and for all i 1,2, … N, a ij 0; definition of N i ={v j ∈V:(v i ,v j ) E represents follower v i Neighbor follower set of (i.e. N) i All followers in (1) directly send information to the followers v i Definition ofMatrix D ═ diag (D) 1 ,d 2 ,...,d N ) Is an in-degree matrix, wherein1,2, N; laplacian matrix L ═ D-a ═ L for directed graph G ═ (V, E, a) ij ]Wherein l is ij =-a ij ,The sum of each row of the Laplace matrix L is zero, i.e., 1 N A right eigenvector of the laplacian matrix L is set, and its corresponding eigenvalue is zero; for a spanning tree, if there is only one node v i Indicating that there is a directed path from one node to any other node in the graph; there is a directed path from each node to any other node; for graphs with spanning trees, strong connectivity is a sufficiently unnecessary condition;
s22: problem description:
considering a multi-agent system consisting of one leader and N followers, and considering a form with a directed graph of communication g (x), the kinetic model of the ith follower is:
whereinAndrespectively the status and input of the ith follower,andinternal function and input matrix function of the ith follower, respectively, and assume f i (x i ),g i (x i ) Is unknown, has f i (0) The system (2-1) has robust stability as 0;
the leader's kinetic model was:
whereinIn the form of the status of the leader,unknown, D is a constant matrix, which is differentiable, and bounded | | | f (x) 0 )||≤ρ 0 ;
According to the network topological relation between each follower and the adjacent agent, the local domain consistency error of the system is described as follows:
whereinAnd b is a i ≧ 0, if and only if b i When the number is more than 0, the ith agent and the leader have communication; consistency information of multi-agent system is composed of consistency error e of local area i When t → ∞, e i → 0, the multi agent system will agree;
s23: adaptive distributed observer
By designing the adaptive distributed observer for each follower, the problem that in a multi-agent system, the follower estimates the state of a leader in real time under the condition that the state of the leader is unknown is solved, and the state of the follower relative to the leader is converted into the state of the adaptive distributed observer relative to the leader;
wherein, the self-adaptive distributed observer is as follows:
wherein x 0 =x,D 0 =D,Mu is more than 0; under the error description of the system, satisfyi is 1,2, …, N, and satisfiesThe adaptive distributed observer comprises a mechanism for estimating a matrix D, and a near unit of a leader knows the matrix;
using estimated values S of S i From the solution of the adaptive calculation equation, the following observer form is obtained:
s24: designing a multi-agent system controller based on reinforcement learning;
consider the following system model:
x k+1 =f(x k )+g(x k )u k (4-1)
wherein,is the state of the system and is,is the control input of the system, the system model can also use more concise x k+1 =F(x k ,u k ) Represents;
for each state x of a multi-agent system k The following control strategies are defined:
u k =h(x k ) (4-2)
the mapping form is also called as a feedback controller, and in the field of feedback control, a feedback control strategy is designed in many ways, including Riccati equation optimal solution, adaptive control, h-infinity control and classical frequency domain control;
to obtain the optimal control strategy of the system, the following cost function is designed for the system:
wherein, the discount factor 0 is more than gamma and less than or equal to 1, u k =h(x k ) Is a control strategy in the design;
or given in standard quadratic form:
suppose the system is at V * The cost paid by the place is minimum, and the optimal cost strategy is as follows:
when the optimal control strategy is taken, the optimal control value given by the system is as follows:
in the original system, the leader of the multi-agent system is considered to have the following model:
x k+1 =f(x k ) (4-7)
with a communication network diagram of a given system, the local coherence error of the system is defined as:
consensus information for multi-agent systems is represented by the above-mentioned systematic consensus error for the local domain, i.e., when t → 0, e i → 0, indicating that the system is about uniform;
designing an additional compensator, independent of the individual subsystems, defined by the input affine differential equation that is expected:
and combining the knowledge of the corresponding graph theory to obtain a global error form of (4-10):
e=L'(x-x 0 ) (4-10)
After the local error e is derived after the simultaneous operations (2-1) and (4-10), the local region consistency error is obtained relative to the graph G (x) as follows:
wherein, f e (t)=f(x(t))-f(x(0)),L i The ith column vector of the Laplace matrix is represented; in combination with (4-10) and (4-11), the local domain coherence error is expressed as:
likewise, returning to the system model at the beginning of the designed continuous time:
given a cost function for continuous-time multi-agent system consistency control:
then the correlation tracking Bellman equation is obtained by the affine differential equations defined by (4-9) and (4-13) using the Leibniz rule:
where U (u) is a positive definite integrand on the control input u:
then (4-15) is expressed by the following equation:
then, the following Hamiltonian equation is defined:
does not let V stand * If the system optimal control cost is obtained, the optimal cost function is defined as follows:
at an optimum cost V * Next, according to the Hamiltonian equation in (4-18), the following HJB equation is obtained:
the following strategy iteration algorithm is obtained:
the algorithm is as follows: HJB equation solving method based on strategy iteration method
Step 211: and (3) policy evaluation: given control input u i (x) Solving for V by the following Bellman equation i (X)
Step 212: (strategy improvement) the control strategy is updated by:
an integral reinforcement learning algorithm is introduced into a strategy iterative algorithm, and on a discrete time system (4-1), for any integral interval T >0, a value function in a continuous system (4-13) meets the following form:
tracking the solution of the Bellman equation by using an integral reinforcement learning algorithm, and solving the HJB equation by using integral reinforcement chemistry under the condition that a system dynamic model is unknown;
obtaining the following integral reinforcement learning algorithm based on strategy iteration:
the algorithm is as follows: strategy iteration-based HJB equation solving method by offline integral reinforcement learning algorithm
Step 221: and (3) policy evaluation: given control input u i (x) Solving for V by the following Bellman equation i (X)
Step 222: strategy improvement: the control strategy is updated by:
s25: design of self-adaptive distributed observer based on reinforcement learning algorithm to realize consistency distributed control of multiple intelligent agents
Multi-agent system:
x i (k+1)=f i (x(k))+g i (x(k))u(k)
y i (k)=cx i (k) (5-1)
wherein x is i ,u i Y, respectively representing the state of the ith agent of the system, controlling inputs and outputs;
the leader model takes the following form into account:
ν(k+1)=Eν(k)
in the leader model referred to herein,is the state of the leader system when agent i satisfies (v) 0 ,ν i ) E epsilon, when a communication connection exists between the follower i and the leader,representing a known constant matrix; q satisfies the condition that Q (0) is 0,is an external reference signal;
the observer is arranged:
wherein R is i (k) Represents an observed value of agent i relative to the leader at time k, and satisfies R 0 (k)=ν(k),W 0 (k)=W,
According to the system description, a cost function of the system is obtained according to the derivation of the optimal output problem formula of the linear system:
wherein, i is 1,2 i Is the discount factor of the number of bits of the file,c=[1,0,0,…0]the optimal feedback input for each folower is obtained by solving equation (5-4):writing (5-4) as a quadratic function when solving for the function optimal feedback inputNumber form, expressed as a function of the value of the system:
the following bellman equation is obtained:
according to the above bellman equation, the HJB equation in nonlinear optimal feedback is defined as:
solving the HJB equation by adopting strategy iteration of IRL;
obtaining the following strategy iteration-based online IRL multi-agent optimal feedback control algorithm:
the algorithm is as follows: online IRL algorithm solution HJB equation based on strategy iteration
Step 231: initialization: selecting a control inputRepeating the following steps to know the convergence of the system;
step 232: strategy improvement: the control strategy is updated by:
step 233: let u i (k)=u i+1 (k) Return to step 231 until V i (k) Convergence to a minimum value;
on the basis of the system (5-1) and (5-2) models, consider the following first-order multi-agent system:
wherein,respectively representing the state and control input of the ith agent at time k; tau. ij ≧ 0 denotes the communication time lag, τ, for data from agent j to agent i i The input time lag of the agent i is more than or equal to 0; consider a first-order discrete multi-agent system comprising n agents, the network topology of which is a static directed weighted graph and which comprises a globally reachable node if satisfiedThen there are: max { d { i (2τ i +1)}<1
assume that a multi-agent system contains 5 nodes, and their corresponding adjacency matrices are as follows:
Suppose that: tau. 13 =1s,τ 21 =0.75s,τ 32 =1.8s,τ 42 =2s,τ 51 When the input time lag tau is 0.8s and the initial state of the agent is randomly generated to be x (0) x (2.5,3,2,3.5,5), the agents finally and gradually get consistent; and the input time lag is changed into 3s, and the systems still achieve consistency.
The invention has the beneficial effects that:
1. the design combines with the related technology of dynamic planning, and provides a single intelligent agent reinforcement learning method based on strategy iteration;
2. the invention designs a self-adaptive distributed observer, which estimates the unmeasured state of a multi-agent system through the output information of an observer system, and solves the defect that the traditional full-dimensional observer cannot carry out information communication in real time;
3. the invention provides a method for calculating an observer equation solution on line, which solves the problems that the observer equation needs to be subjected to multi-order derivation and a Lyapunov function is constructed in multiple steps in the traditional method;
4. the invention provides a method for approximating the constructed value function by a strategy iterative algorithm based on integral reinforcement learning by means of an output adjustment technology to solve the HJB equation, thereby solving the problem that the traditional method is difficult to solve the HJB equation to obtain the optimal solution;
5. the invention adopts a reinforcement learning-based method to design a controller, obtains an optimal solution through an iteration method, integrates the knowledge in the aspect of the existing reinforcement learning, constructs a method for solving the problem of output adjustment of the distributed consistency of the system by adopting self-adaptive state feedback and self-adaptive measurement output feedback control under the condition that each follower does not know the state of the leader system, and realizes the distributed consistency control of the multi-agent system.
6. The invention provides a consistency control method based on reinforcement learning under different time lags, which considers the problem of constraint input, selects a required performance function as a non-quadratic penalty function, and then considers the problems of communication time lag between a system folower and a leader and input time lag between folwers.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For a better understanding of the objects, aspects and advantages of the present invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a diagram of an intelligent system node connection topology;
fig. 2 is a simulation result diagram.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
Aiming at the problem of distributed consistency control of a multi-agent system, the invention solves the consistency output regulation problem of a linear multi-agent system by designing a self-adaptive distributed observer for each follower on the premise that each follower knows a system matrix S of the leader system. The scheme adopted in the invention firstly when designing the self-adaptive distributed observer consists of three steps. First, an adaptive distributed observer is designed to estimate the state of the system matrix and the leader system. Secondly, a method for calculating the equation solution of the observer on line is provided after the adaptive distributed observer is designed. Thirdly, in order to eliminate few extreme cases, under the condition that each follower does not know the matrix of the leader system, the problem of distributed consistency output adjustment of the system is solved by integrating adaptive state feedback and adaptive measurement output feedback control.
After the self-adaptive distributed observer is constructed, the unmeasured state of the multi-agent system is estimated through the output information of the observer system, then, based on the estimated state, an optimal controller is designed by adopting a reinforcement learning method, meanwhile, the problem of constraint input is considered, the required performance function can be selected as a non-quadratic penalty function, then, the problems of communication time lag between a folower and a leader and input time lag between the folower are considered, and the consistency problem under different time lags is verified. The invention designs an ADP method-based reinforcement learning controller, which establishes an HJB (Hamilton Jacobi Bellman) equation of the established cost function by establishing the cost function of a reinforcement learning-based multi-agent system, solves the HJB equation by a synchronous reinforcement learning-based method, and finally obtains an optimal solution by an iterative method to realize the optimal control of the multi-agent system.
In order to achieve the purpose, the technical scheme of the invention is as follows:
the multi-agent distributed consistency control method adopts the self-adaptive distributed observer to estimate the state of the system, adopts a reinforcement learning based method to design the controller according to the estimated state, and obtains the optimal solution through an iterative method to realize the optimal control of the multi-agent system.
First part is based on single intelligent agent optimal output control of reinforcement learning
When the optimal controller of the single intelligent agent is designed, only the solution of the HJB equation needs to be learned and tracked on line by adopting a non-strategy reinforcement learning algorithm without any system dynamics knowledge
Consider the following system model:
where x, u are the status and control inputs of the system, respectively, and d is an external disturbance. Assuming that f (x), g (x), l (x) are Lipchitz functions, and f (0) ═ 0, the system is robust and stable.
Let p (t) be the consistency condition that needs to be achieved and satisfy the following form:
and h (0) ═ 0, the tracking error of the whole system is defined as:
simultaneous reaction of (1-1), (1-2) and (1-3) to obtain:
the following virtual performance outputs are defined to meet the requirements:
the following performance function (cost function) is defined for the system:
suppose the system is in u * The initial satisfaction cost is minimized, then
The following bellman equation is given:
wherein,
is an amplification system designed for the system.
According to the optimum conditionsAndcan obtain optimal control outputEntering and optimal interference input:
wherein V * Is the optimization value function defined in (1-7).
From the optimal input conditions obtained in (1-10), the following trajectory tracking hjb (hamilton Jacobi bellman) equation can be obtained:
thus, the following single agent offline RL algorithm is obtained:
second part reinforcement learning based multi-agent consistency control
Firstly, graph theory:
let G ═ V, E, a be a weighted graph, which describes the information channels between N agents. V is follower node { V 1 ,v 2 ,…v N A non-empty finite set of };is an edge set;is a weighted adjacency matrix, and when (v) i ,v j ) E is E, a ij Is greater than 0; if it isa ij 0 and for all i 1,2, … N, a ij 0. Definition of N i ={v j ∈V:(v i ,v j ) E represents follower v i Neighbor follower set of (i.e. N) i All followers in (1) can directly send information to the follower v i In the definition matrix D ═ diag (D) 1 ,d 2 ,...,d N ) Is an in-degree matrix, wherein1, 2. Then, the laplacian matrix L ═ D-a ═ L of the directed graph G ═ (V, E, a) ij ]Wherein l is ij =-a ij ,It is thus easy to know that the sum of each row of the laplace matrix L is zero, i.e. 1 N Is a right eigenvector of the laplacian matrix L, whose corresponding eigenvalue is zero. In the invention, only a simple graph is considered, only one spanning tree is provided, if only one node v is provided i Indicating that there is a directed path from one node to any other node in the graph. Although the directed graph is strongly connected, there is a directed path from each node to any other node. For graphs with spanning trees, strong connectivity is a sufficiently unnecessary condition.
Secondly, problem description:
considering a multi-agent system consisting of one leader and N followers, and considering a form with a directed graph of communication g (x), the kinetic model of the ith follower is:
whereinAndrespectively the status and input of the ith follower,andinternal function and input matrix function of the ith follower, respectively, and assume f i (x i ),g i (x i ) Is unknown, has f i (0) The system (2-1) has robust stability at 0.
The leader's kinetic model was:
whereinIs the state of the leader and is,unknown, D is a constant matrix, but it is assumed that it is differentiable and bounded | | | f (x) 0 )||≤ρ 0 。
According to the network topological relation between each follower and the adjacent agent, the local field consistency error of the system can be described as follows;
whereinAnd b is i ≧ 0, if and only if b i >0, there is communication between the ith agent and the leader. As can be seen from the above equation, the consistency information of the multi-agent systemInformation can be represented by a local domain coherence error e i To represent, when t → ∞ is reached, e i → 0, the multi-agent system will agree.
Adaptive distributed observer
In the invention, by designing the adaptive distributed observer for each follower, the problem that in a multi-agent system, in the case of unknown state of a leader, the follower can estimate the state of the leader in real time can be solved, and the state of the follower relative to the leader can be converted into the state of the adaptive distributed observer relative to the leader in the design.
The involved adaptive distributed observer is as follows:
wherein x 0 =x,D 0 =D,Mu is more than 0; under the error description of the system, satisfyi is 1,2, …, N, and satisfiesThis observer contains a mechanism for estimating the matrix D, hence the so-called adaptive distributed observer, which is known from (1-3) only from the leader's close-up unit, and is therefore more stable and accurate than the generic distributed observer.
Since the control law needs to utilize the designTo provide a suitable feed forward control to achieve the control objective, and the solution of the equation depends on the system matrix S of the system. Since the follower in the leader child node does not know the matrix S, it is proposed to use S i (estimated value of S) is derived from the solution of the adaptive calculation equation, so the following observer form can be obtained:
fourth, multi-agent system controller design based on reinforcement learning
Before the research on the consistency control of multi-agent systems, a specific dynamic model needs to be discussed, and the research on multi-agent systems is mostly based on the research on Adaptive Dynamic Programming (ADP), and most of the research on adaptive dynamic programming is performed on systems operating in Discrete Time (DT). Therefore, considering the nonlinear discrete time system first, some methods for optimally controlling the discrete time system are summarized, and then an online reinforcement learning scheme is designed for the discrete time system by combining a reinforcement learning method with a linear quadratic regulation technology (LQR). Consider the following system model:
x k+1 =f(x k )+g(x k )u k (4-1)
wherein,it is the state of the system that is,is the control input of the system, the system model can also use more concise x k+1 =F(x k ,u k ) And (4) showing.
For each state x of a multi-agent system k The following control strategies are defined:
u k =h(x k ) (4-2)
the above mapping form is also called as a feedback controller, and in the field of feedback control, the design of a feedback control strategy is many, including Riccati equation optimal solution, adaptive control, h ∞ control and classical frequency domain control. In the reinforcement learning scheme of the invention, the control strategy is learned in real time according to the stimulation of the environment.
To obtain the optimal control strategy for the system, the following cost function is now designed for the system:
wherein, the discount factor 0 is more than gamma and less than or equal to 1, u k =h(x k ) Is the control strategy in the design. It can also be given in the following more general standard quadratic form:
suppose the system is at V * If the cost paid is the minimum, the optimal cost policy is as follows:
when the optimal control strategy is taken, the optimal control value given by the system is as follows:
in the original system, the leader of the multi-agent system is considered to have the following model:
x k+1 =f(x k ) (4-7)
with a communication network diagram of a given system, the local coherence error of the system is defined as:
as can be seen from the above formula, the consensus information of the multi-agent system can be represented by the above system consensus error of the local area, i.e., when t → 0, e i → 0, the system tends to be consistent.
In order to overcome the defect that the self-adaptive dynamic planning method based on reinforcement learning strongly depends on each intelligent system model, an additional compensator is designed, can not depend on each subsystem, and is defined by an expected input affine differential equation as follows:
and combining the knowledge of the corresponding graph theory to obtain a global error form of (4-10):
e=L'(x-x 0 ) (4-10)
After the local error e is derived after the simultaneous operations (2-1) and (4-10), the local region consistency error is obtained relative to the graph G (x) as follows:
wherein, f e (t)=f(x(t))-f(x(0)),L i The ith column vector of the laplacian matrix is represented. Combining (1-11) and (1-12), the local domain coherence error can be expressed as:
likewise, returning to the system model at the beginning of the designed continuous time:
given a cost function for continuous-time multi-agent system consistency control:
then the correlation tracking Bellman equation can be obtained by the affine differential equations defined by (4-9) and (4-13) using the Leibniz rule:
where U (u) is a positive definite integrand on the control input u:
then (4-15) can be expressed by the following equation:
then, the following Hamiltonian (Hamiltonian) equation is defined:
does not let V stand * If the system optimal control cost is obtained, the optimal cost function is defined as follows:
at an optimum cost V * Next, from the Hamiltonian equation in (4-18), the following HJB (Hamilton Jacobi Bellman) equation can be obtained:
when the stability condition is satisfiedThen, the following optimal control inputs can be obtained:
the following strategy iteration algorithm is obtained:
in order to realize the optimal consistency control of the multi-agent system under the condition of unknown system dynamics, an integral reinforcement learning algorithm can be introduced under the method of the strategy iteration, and on the discrete time system (4-1), for any integral interval T >0, a value function in the continuous system (4-13) can satisfy the following form:
then, tracking the solution of the Bellman equation by using an integral reinforcement learning algorithm, and solving the HJB equation by using integral reinforcement chemistry can be realized under the condition that a system dynamic model is unknown.
Obtaining the following integral reinforcement learning algorithm based on strategy iteration:
fifthly, designing a self-adaptive distributed observer based on a reinforcement learning algorithm to realize the consistency distributed control of multiple intelligent agents
In this section, the output information of the observer system designed in the third section is combined to estimate the unmeasured state of the multi-agent system, then, based on the estimated state, an optimal controller is designed by adopting a reinforcement learning method, meanwhile, the required performance function can be selected as a non-quadratic penalty function by considering the constraint input problem, then, the communication time lag between the follower and the system and the input time lag between the follower are considered, and the consistency problem under different conditions of time lag is verified in this section.
Consider the following multi-agent system:
x i (k+1)=f i (x(k))+g i (x(k))u(k)
y i (k)=cx i (k) (5-1)
wherein x is i ,u i Y, respectively representing the status of the i-th agent of the system, the control input andand (6) outputting.
The leader model takes the following form into account:
ν(k+1)=Eν(k)
in the leader model referred to in the above,is the state of the leader system when agent i satisfies (v) 0 ,ν i ) E epsilon (i.e. when there is a communication connection between the follower i and the leader)Representing a known constant matrix. Q satisfies the condition that Q (0) is 0,is an external reference signal.
Considering that the adaptive distributed observer designed in the third part has unpredictability in parameter design and is not suitable for discrete time system, a simpler and more easily designed observer is adopted in the present part:
wherein R is i (k) Represents an observed value of agent i relative to the leader at time k, and satisfies R 0 (k)=ν(k),W 0 (k)=W,
According to the above system description, a cost function of the system can be obtained according to the derivation of the optimal output problem formula of the linear system:
wherein, i is 1,2 i Is the discount factor that is to be used,c=[1,0,0,…0]by solving equation (5-4), the optimal feedback input for each follower can be obtained:in solving for the function optimal feedback input, (5-4) can be written in the form of a quadratic function, expressed as a systematic value function:
the following bellman equation is obtained:
this is a consistency equation satisfying a value function, and according to the above bellman equation, the HJB equation in the nonlinear optimal feedback can be defined as:
because the HJB equation is difficult to solve, the strategy iteration of the IRL is adopted in the algorithm to solve the HJB equation.
Obtaining the following strategy iteration-based online IRL multi-agent optimal feedback control algorithm:
now, the problem of communication time lag between the system follower and the leader and the problem of input time lag between the followers are considered, and the consistency problem under different time lags is verified in this section.
On the basis of the original system (5-1) and (5-2) models, the following first-order multi-agent system is considered:
wherein,respectively representing the status and control inputs of the ith agent at time k. Tau. ij ≧ 0 denotes communication time lag, τ, of data from agent j to agent i i And more than or equal to 0 represents the input time lag of the agent i. Consider a first-order discrete multi-agent system comprising n agents, the network topology of which is a static directed weighted graph and which comprises a globally reachable node if satisfiedThen there are: max { d { i (2τ i +1)}<1
suppose a multi-agent system comprises 5 nodes, the connection topology of which is shown in FIG. 1, which is a directed weighted graph with node 1 being globally reachable. From the network topology of fig. 1, the corresponding adjacency matrix can be obtained as follows:
according to the setting, the input time lag of the intelligent agent should satisfyNow, the following assumptions are made: tau is 13 =1s,τ 21 =0.75s,τ 32 =1.8s,τ 42 =2s,τ 51 When the input time lag τ is 0.5s and the initial state of the agent is randomly generated as x (0) — (2.5,3,2,3.5,5), the simulation result is shown in fig. 2, and it can be seen that the agents eventually gradually agree. Then, the input skew is changed to 3s, and the system can still achieve coincidence.
Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (1)
1. A model unknown multi-agent consistency control method based on reinforcement learning is characterized in that: the method comprises the following steps:
s1: performing optimal output control of the single agent based on reinforcement learning;
s2: multi-agent consistency control based on reinforcement learning;
the S1 specifically includes:
when designing the optimal controller of a single intelligent agent, a non-strategy reinforcement learning algorithm is adopted to learn and track the solution of the HJB equation on line, and the following system models are considered:
where x, u are the status and control inputs of the system, respectively, and d is external interference; assuming that f (x), g (x), l (x) are Lipchitz functions, and f (0) ═ 0, the system is robust and stable;
assume that p (t) is a condition of consistency that needs to be achieved and satisfies the following form:
and h (0) is equal to 0, the tracking error of the whole system is defined as:
simultaneous reaction of (1-1), (1-2) and (1-3) to obtain:
the following virtual performance outputs are defined to meet the requirements:
defining a performance function for the system:
suppose the system is in u * The initial satisfaction cost is minimized, then
The following bellman equation is given:
wherein,
is an amplification system designed for the system;
according to the optimum conditionsAndobtaining an optimal control input and an optimal interference input:
wherein V * Is the optimized value function defined in (1-7);
and (4) obtaining the following HJB equation of track tracking according to the optimal input conditions obtained in the step (1-10):
thus, the following single agent offline RL algorithm is obtained:
solving HJB equation based on RL algorithm
Step 11: initialization: given an allowable stability control strategy value u 0
Step 12: and (3) policy evaluation: for a control input u i And a disturbance input d i The following Bellman equation was used:
step 13: interference d of updating system i :
Step 14: updating input u of system i :
Step 15: re-execution of step 11
The S2 specifically includes:
s21: establishing a graph theory:
let G ═ V, E, a be a weighted graph, which describes the information channels between N agents; v is follower node { V 1 ,v 2 ,…v N A non-empty finite set of };is an edge set;is a weighted adjacency matrix, and when (v) i ,v j ) E is epsilon, a ij Is greater than 0; if it isa ij 0 and for all i 1,2, … N, a ij 0; definition of N i ={v j ∈V:(v i ,v j ) E represents follower v i Neighbor follower set of (i.e. N) i All followers in (1) send information directly to the follower v i Defining the matrix D ═ diag (D) 1 ,d 2 ,...,d N ) Is an in-degree matrix, whereinLaplacian matrix L ═ D-a ═ L for directed graph G ═ (V, E, a) ij ]Wherein l is ij =-a ij ,The sum of each row of the laplace matrix L is zero, i.e. 1 N A right eigenvector of the Laplace matrix L, the corresponding eigenvalue of which is zero; for a spanning tree, if there is only one node v i Indicating that there is a directed path from one node to any other node in the graph; there is a directed path from each node to any other node; for graphs with spanning trees, strong connectivity is a sufficiently unnecessary condition;
s22: problem description:
considering a multi-agent system consisting of one leader and N followers, and considering a form with a directed graph of communication g (x), the kinetic model of the ith follower is:
whereinAndrespectively the status and input of the ith follower,andinternal function and input matrix function of the ith follower, respectively, and assume f i (x i ),g i (x i ) Is unknown, has f i (0) The system (2-1) has robust stability as 0;
the leader's kinetic model was:
whereinIn the form of the status of the leader,unknown, D is a constant matrix, which is differentiable, and bounded | | | f (x) 0 )||≤ρ 0 ;
According to the network topological relation between each follower and the adjacent agent, the local domain consistency error of the system is described as follows:
whereinAnd b is i ≧ 0, if and only if b i When the number is more than 0, the ith agent and the leader have communication; consistency information of multi-agent system is composed of consistency error e of local area i When t → ∞, e i → 0, the multi-agent system will agree;
s23: adaptive distributed observer
By designing the adaptive distributed observer for each follower, the problem that in a multi-agent system, the follower estimates the state of a leader in real time under the condition that the state of the leader is unknown is solved, and the state of the follower relative to the leader is converted into the state of the adaptive distributed observer relative to the leader;
wherein, the self-adaptive distributed observer is as follows:
wherein x 0 =x,D 0 =D,Mu is more than 0; under the error description of the system, satisfyAnd satisfyThe adaptive distributed observer comprises a mechanism for estimating a matrix D, which is known only by the leader's approach unit;
using estimated values S of S i From the solution of the adapted calculation equation, the following observer form is obtained:
s24: designing a multi-agent system controller based on reinforcement learning;
consider the following system model:
x k+1 =f(x k )+g(x k )u k (4-1)
wherein,it is the state of the system that is,is the control input of the system, the system model can also use more concise x k+1 =F(x k ,u k ) Represents;
for each state x of a multi-agent system k The following control strategies are defined:
u k =h(x k ) (4-2)
the mapping form is also called as a feedback controller, and in the field of feedback control, feedback control strategies are designed greatly and comprise Riccati equation optimal solution, adaptive control, h-infinity control and classical frequency domain control;
in order to obtain the optimal control strategy of the system, the following cost function is designed for the system:
wherein, the discount factor 0 is more than gamma and less than or equal to 1, u k =h(x k ) Is a control strategy in the design;
or given in standard quadratic form:
suppose the system is at V * If the cost paid is the minimum, the optimal cost policy is as follows:
when the optimal control strategy is taken, the optimal control value given by the system is as follows:
in the original system, the leader of the multi-agent system is considered to have the following model:
x k+1 =f(x k ) (4-7)
with a communication network diagram of a given system, the local coherence error of the system is defined as:
consensus information for multi-agent systems is represented by the above-mentioned systematic consensus error for the local domain, i.e., when t → 0, e i → 0, indicating that the system is about uniform;
designing an additional compensator, independent of the individual subsystems, defined by the input affine differential equation that is expected:
and combining the knowledge of the corresponding graph theory to obtain a global error form of (4-10):
e=L'(x-x 0 ) (4-10)
After the local error e is derived after the simultaneous operations (2-1) and (4-10), the local region consistency error is obtained relative to the graph G (x) as follows:
wherein f is e (t)=f(x(t))-f(x(0)),L i The ith column vector of the Laplace matrix is represented; in combination with (4-10) and (4-11), the local domain coherence error is expressed as:
likewise, return to the system model at the continuous time as originally designed:
given a cost function for continuous-time multi-agent system consistency control:
then the correlation tracking Bellman equation is obtained by the affine differential equations defined by (4-9) and (4-13) using the Leibniz rule:
where U (u) is a positive definite integrand on the control input u:
then (4-15) is expressed by the following equation:
then, the following Hamiltonian equation is defined:
does not let V stand * If the system optimal control cost is obtained, the optimal cost function is defined as follows:
at an optimum cost V * Next, according to the Hamiltonian equation in (4-18), the following HJB equation is obtained:
the following strategy iteration algorithm is obtained:
the algorithm is as follows: HJB equation solving method based on strategy iteration method
Step 211: and (3) policy evaluation: given a control input u i (x) Solving for V by the Bellman equation i (X)
Step 212: (strategy improvement) the control strategy is updated by:
an integral reinforcement learning algorithm is introduced into a strategy iterative algorithm, and on a discrete time system (4-1), for any integral interval T >0, a value function in a continuous system (4-13) meets the following form:
tracking the solution of the Bellman equation by using an integral reinforcement learning algorithm, and solving the HJB equation by using integral reinforcement chemistry under the condition that a system dynamic model is unknown;
obtaining the following integral reinforcement learning algorithm based on strategy iteration:
the algorithm is as follows: strategy iteration-based HJB equation solving method by offline integral reinforcement learning algorithm
Step 221: and (3) policy evaluation: given control input u i (x) Solving for V by the Bellman equation i (X)
Step 222: strategy improvement: the control strategy is updated by:
s25: an adaptive distributed observer is designed based on a reinforcement learning algorithm to realize a multi-agent consistency distributed control multi-agent system:
x i (k+1)=f i (x(k))+g i (x(k))u i (k)
y i (k)=cx i (k) (5-1)
wherein x is i ,u i ,y i Respectively representing the state of the ith agent of the system, and controlling input and output;
the leader model considers the form as u ():
ν(k+1)=Eν(k)
in the leader model referred to herein,is the state of the leader system when agent i satisfies (v) 0 ,ν i ) E epsilon, when a communication connection exists between the follower i and the leader,representing a known constant matrix; q satisfies the condition that Q (0) is 0,is an external reference signal;
the observer is arranged:
wherein R is i (k) Represents an observed value of agent i relative to the leader at time k, and satisfies R 0 (k)=ν(k),W 0 (k)=W,
According to the system description, a cost function of the system is obtained according to the derivation of the optimal output problem formula of the linear system:
wherein, i is 1,2 i Is the discount factor that is to be used,c=[1,0,0,...0]the optimal feedback input for each folower is obtained by solving equation (5-4):when solving for the function optimal feedback input, (5-4) is written as a quadratic function, expressed as a systematic value function:
the following bellman equation is obtained:
according to the above bellman equation, the HJB equation in nonlinear optimal feedback is defined as:
solving the HJB equation by adopting strategy iteration of IRL;
obtaining the following strategy iteration-based online IRL multi-agent optimal feedback control algorithm:
the algorithm is as follows: online IRL algorithm solution HJB equation based on strategy iteration
Step 231: initialization: selecting a control inputRepeating the following steps until the system converges;
step 232: strategy improvement: the control strategy is updated by:
step 233: let u be i (k)=u i+1 (k) Return to step 231 until V i (k) Convergence to a minimum value;
on the basis of the system (5-1) and (5-2) models, consider the following first order multi-agent system:
wherein,respectively representing the state and control input of the ith agent at time k; tau is ij ≧ 0 denotes the communication time lag, τ, for data from agent j to agent i i The input time lag of the agent i is more than or equal to 0; consider a first-order discrete multi-agent system comprising n agents, the network topology of which is a static directed weighted graph and which comprises a globally reachable node if satisfiedThen there are: max { d { i (2τ i +1)}<1
assume that a multi-agent system contains 5 nodes, and their corresponding adjacency matrices are as follows:
Suppose that: tau is 13 =1s,τ 21 =0.75s,τ 32 =1.8s,τ 42 =2s,τ 51 When the input time lag tau is 0.8s and the initial state of the agent is randomly generated to be x (0) x (2.5,3,2,3.5,5), the agents finally and gradually get consistent; the input time lag is changed to 3s, and the system still achieves consistency.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110184288.2A CN112947084B (en) | 2021-02-08 | 2021-02-08 | Model unknown multi-agent consistency control method based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110184288.2A CN112947084B (en) | 2021-02-08 | 2021-02-08 | Model unknown multi-agent consistency control method based on reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112947084A CN112947084A (en) | 2021-06-11 |
CN112947084B true CN112947084B (en) | 2022-09-23 |
Family
ID=76245480
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110184288.2A Expired - Fee Related CN112947084B (en) | 2021-02-08 | 2021-02-08 | Model unknown multi-agent consistency control method based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112947084B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113359476B (en) * | 2021-07-09 | 2022-09-16 | 广东华中科技大学工业技术研究院 | Consistency control algorithm design method of multi-agent system under discrete time |
CN113792843B (en) * | 2021-08-19 | 2023-07-25 | 中国人民解放军军事科学院国防科技创新研究院 | Congestion emergence control method based on group direction consistency and stability under behavioral cloning framework |
CN113848703B (en) * | 2021-08-28 | 2023-12-08 | 同济大学 | Multi-agent system state estimation method |
CN113900380B (en) * | 2021-11-17 | 2023-02-28 | 北京航空航天大学 | Robust output formation tracking control method and system for heterogeneous cluster system |
CN114371620B (en) * | 2021-12-22 | 2023-08-29 | 同济大学 | High-order nonlinear heterogeneous multi-agent consistency controller design method and device |
CN114547980B (en) * | 2022-02-24 | 2024-06-07 | 重庆大学 | Multi-agent finite time event trigger control method with time-varying state constraint |
CN114867024A (en) * | 2022-04-22 | 2022-08-05 | 广州大学 | Control method for information security transmission of complex monitoring network of marine smart fishery |
CN114784828B (en) * | 2022-05-17 | 2024-08-16 | 上海交通大学 | Electric power system transient-steady state frequency optimal control method based on reinforcement learning |
CN114706359B (en) * | 2022-06-06 | 2022-08-26 | 齐鲁工业大学 | Agricultural multi-agent system consistency distributed control method based on sampling data |
CN114935944A (en) * | 2022-06-22 | 2022-08-23 | 南京航空航天大学 | Fixed-wing unmanned aerial vehicle longitudinal control method based on output feedback Q learning |
CN115268275B (en) * | 2022-08-24 | 2024-05-28 | 广东工业大学 | Multi-agent system consistency tracking method and system based on state observer |
CN116149178B (en) * | 2022-12-08 | 2023-09-26 | 哈尔滨理工大学 | Networked prediction control method based on amplifying-forwarding repeater |
CN115877718B (en) * | 2023-02-23 | 2023-05-30 | 北京航空航天大学 | Data-driven heterogeneous missile formation switching communication topology cooperative control method |
CN116594301A (en) * | 2023-05-16 | 2023-08-15 | 东北大学秦皇岛分校 | Intermittent pinning synchronous control method of uncertain heterogeneous network |
CN116560236B (en) * | 2023-05-29 | 2024-08-27 | 东北大学秦皇岛分校 | Distributed consistency control method with energy minimization taking control input change into consideration |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011144382A1 (en) * | 2010-05-17 | 2011-11-24 | Technische Universität München | Hybrid oltp and olap high performance database system |
CN105847438A (en) * | 2016-05-26 | 2016-08-10 | 重庆大学 | Event trigger based multi-agent consistency control method |
CN106899026A (en) * | 2017-03-24 | 2017-06-27 | 三峡大学 | Intelligent power generation control method based on the multiple agent intensified learning with time warp thought |
CN107179777A (en) * | 2017-06-03 | 2017-09-19 | 复旦大学 | Multiple agent cluster Synergistic method and multiple no-manned plane cluster cooperative system |
CN107589672A (en) * | 2017-09-27 | 2018-01-16 | 三峡大学 | The intelligent power generation control method of isolated island intelligent power distribution virtual wolf pack control strategy off the net |
CN110018687A (en) * | 2019-04-09 | 2019-07-16 | 大连海事大学 | Unmanned water surface ship optimal track following control method based on intensified learning method |
CN110083063A (en) * | 2019-04-29 | 2019-08-02 | 辽宁石油化工大学 | A kind of multiple body optimal control methods based on non-strategy Q study |
CN110308659A (en) * | 2019-08-05 | 2019-10-08 | 沈阳航空航天大学 | Uncertain multi-agent system mixing with time delay and switching topology triggers consistent control method |
CN110782011A (en) * | 2019-10-21 | 2020-02-11 | 辽宁石油化工大学 | Networked multi-agent system distributed optimization control method based on reinforcement learning |
CN110780668A (en) * | 2019-04-09 | 2020-02-11 | 北京航空航天大学 | Distributed formation surround tracking control method and system for multiple unmanned boats |
CN111531538A (en) * | 2020-05-08 | 2020-08-14 | 哈尔滨工业大学 | Consistency control method and device for multi-mechanical arm system under switching topology |
CN111694365A (en) * | 2020-07-01 | 2020-09-22 | 武汉理工大学 | Unmanned ship formation path tracking method based on deep reinforcement learning |
CN111722531A (en) * | 2020-05-12 | 2020-09-29 | 天津大学 | Online model-free optimal control method for switching linear system |
CN111948937A (en) * | 2020-07-20 | 2020-11-17 | 电子科技大学 | Multi-gradient recursive reinforcement learning fuzzy control method and system of multi-agent system |
CN112052585A (en) * | 2020-09-02 | 2020-12-08 | 南京邮电大学 | Design method of distributed dimensionality reduction observer of linear time invariant system |
CN112180730A (en) * | 2020-10-10 | 2021-01-05 | 中国科学技术大学 | Hierarchical optimal consistency control method and device for multi-agent system |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9134707B2 (en) * | 2012-03-30 | 2015-09-15 | Board Of Regents, The University Of Texas System | Optimal online adaptive controller |
CN107728471A (en) * | 2017-09-01 | 2018-02-23 | 南京理工大学 | For a kind of packet uniformity control method for mixing heterogeneous multi-agent system |
US10734811B2 (en) * | 2017-11-27 | 2020-08-04 | Ihi Inc. | System and method for optimal control of energy storage system |
CN109946975B (en) * | 2019-04-12 | 2020-04-24 | 北京理工大学 | Reinforced learning optimal tracking control method of unknown servo system |
US11674384B2 (en) * | 2019-05-20 | 2023-06-13 | Schlumberger Technology Corporation | Controller optimization via reinforcement learning on asset avatar |
CN111880567B (en) * | 2020-07-31 | 2022-09-16 | 中国人民解放军国防科技大学 | Fixed-wing unmanned aerial vehicle formation coordination control method and device based on deep reinforcement learning |
-
2021
- 2021-02-08 CN CN202110184288.2A patent/CN112947084B/en not_active Expired - Fee Related
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011144382A1 (en) * | 2010-05-17 | 2011-11-24 | Technische Universität München | Hybrid oltp and olap high performance database system |
CN105847438A (en) * | 2016-05-26 | 2016-08-10 | 重庆大学 | Event trigger based multi-agent consistency control method |
CN106899026A (en) * | 2017-03-24 | 2017-06-27 | 三峡大学 | Intelligent power generation control method based on the multiple agent intensified learning with time warp thought |
CN107179777A (en) * | 2017-06-03 | 2017-09-19 | 复旦大学 | Multiple agent cluster Synergistic method and multiple no-manned plane cluster cooperative system |
CN107589672A (en) * | 2017-09-27 | 2018-01-16 | 三峡大学 | The intelligent power generation control method of isolated island intelligent power distribution virtual wolf pack control strategy off the net |
CN110780668A (en) * | 2019-04-09 | 2020-02-11 | 北京航空航天大学 | Distributed formation surround tracking control method and system for multiple unmanned boats |
CN110018687A (en) * | 2019-04-09 | 2019-07-16 | 大连海事大学 | Unmanned water surface ship optimal track following control method based on intensified learning method |
CN110083063A (en) * | 2019-04-29 | 2019-08-02 | 辽宁石油化工大学 | A kind of multiple body optimal control methods based on non-strategy Q study |
CN110308659A (en) * | 2019-08-05 | 2019-10-08 | 沈阳航空航天大学 | Uncertain multi-agent system mixing with time delay and switching topology triggers consistent control method |
CN110782011A (en) * | 2019-10-21 | 2020-02-11 | 辽宁石油化工大学 | Networked multi-agent system distributed optimization control method based on reinforcement learning |
CN111531538A (en) * | 2020-05-08 | 2020-08-14 | 哈尔滨工业大学 | Consistency control method and device for multi-mechanical arm system under switching topology |
CN111722531A (en) * | 2020-05-12 | 2020-09-29 | 天津大学 | Online model-free optimal control method for switching linear system |
CN111694365A (en) * | 2020-07-01 | 2020-09-22 | 武汉理工大学 | Unmanned ship formation path tracking method based on deep reinforcement learning |
CN111948937A (en) * | 2020-07-20 | 2020-11-17 | 电子科技大学 | Multi-gradient recursive reinforcement learning fuzzy control method and system of multi-agent system |
CN112052585A (en) * | 2020-09-02 | 2020-12-08 | 南京邮电大学 | Design method of distributed dimensionality reduction observer of linear time invariant system |
CN112180730A (en) * | 2020-10-10 | 2021-01-05 | 中国科学技术大学 | Hierarchical optimal consistency control method and device for multi-agent system |
Non-Patent Citations (8)
Title |
---|
Data-Driven Reinforcement Learning Design for Multi-agent Systems with Unknown Disturbances;Xiangnan Zhong等;《2018 International Joint Conference on Neural Networks》;20181015;全文 * |
Optimal tracking agent: a new framework of reinforcement learning for multiagent systems;Cao, WH等;《CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE》;20130925;第25卷(第14期);全文 * |
Reinforcement Learning Control for Consensus of the Leader-Follower Multi-Agent Systems;Chiang, ML等;《IEEE 7th Data Driven Control and Learning Systems Conference》;20181231;全文 * |
Resilient adaptive optimal control of distributed multi-agent systems using reinforcement learning;Moghadam, R等;《IET CONTROL THEORY AND APPLICATIONS》;20181106;第12卷(第16期);全文 * |
动态约束下可重构模块机器人分散强化学习最优控制;董博等;《吉林大学学报》;20141231;第44卷(第5期);全文 * |
基于强化学习的数据驱动多智能体系统最优一致性综述;李金娜等;《智能科学与技术学报》;20201231;第2卷(第4期);全文 * |
基于观测器的多智能体系统一致性控制与故障检测;陈刚等;《控制理论与应用》;20140531;第31卷(第5期);全文 * |
无模型自适应动态规划及其在多智能体协同控制中的应用;杨永亮;《中国博士学位论文全文数据库 信息科技辑》;20180315(第03期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112947084A (en) | 2021-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112947084B (en) | Model unknown multi-agent consistency control method based on reinforcement learning | |
Peng et al. | Data-driven optimal tracking control of discrete-time multi-agent systems with two-stage policy iteration algorithm | |
Liu et al. | Adaptive neural output feedback tracking control for a class of uncertain discrete-time nonlinear systems | |
Zhao et al. | Distributed adaptive fixed-time consensus tracking for second-order multi-agent systems using modified terminal sliding mode | |
CN112180734A (en) | Multi-agent consistency method based on distributed adaptive event triggering | |
CN113900380B (en) | Robust output formation tracking control method and system for heterogeneous cluster system | |
CN114237041B (en) | Space-ground cooperative fixed time fault tolerance control method based on preset performance | |
Lu et al. | On robust control of uncertain chaotic systems: a sliding-mode synthesis via chaotic optimization | |
Song et al. | Multi-objective optimal control for a class of unknown nonlinear systems based on finite-approximation-error ADP algorithm | |
Liu et al. | Data-driven bipartite consensus tracking for nonlinear multiagent systems with prescribed performance | |
CN114063438B (en) | Data-driven multi-agent system PID control protocol self-learning method | |
CN114626307B (en) | Distributed consistent target state estimation method based on variational Bayes | |
CN110362103B (en) | Distributed autonomous underwater vehicle attitude collaborative optimization control method | |
Gao et al. | Tracking control of the nodes for the complex dynamical network with the auxiliary links dynamics | |
Song et al. | Adaptive dynamic programming: single and multiple controllers | |
CN114498612A (en) | Distributed multi-region power transmission network collaborative economic dispatching method and computing device | |
Rao et al. | Optimal control of nonlinear system based on deterministic policy gradient with eligibility traces | |
CN113095513A (en) | Double-layer fair federal learning method, device and storage medium | |
Wang et al. | Suboptimal leader-to-coordination control for nonlinear systems with switching topologies: A learning-based method | |
Martinez-Piazuelo et al. | Distributed Nash equilibrium seeking in strongly contractive aggregative population games | |
CN107491841A (en) | Nonlinear optimization method and storage medium | |
Wang et al. | Decentralized adaptive consensus control of uncertain nonlinear systems under directed topologies | |
JPH07200512A (en) | 1optimization problems solving device | |
Han et al. | Prescribed time bipartite output consensus tracking for heterogeneous multi-agent systems with external disturbances | |
CN113359474B (en) | Extensible distributed multi-agent consistency control method based on gradient feedback |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220923 |