CN112947084B

CN112947084B - Model unknown multi-agent consistency control method based on reinforcement learning

Info

Publication number: CN112947084B
Application number: CN202110184288.2A
Authority: CN
Inventors: 陈刚; 林卓龙
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2022-09-23
Anticipated expiration: 2041-02-08
Also published as: CN112947084A

Abstract

The invention relates to a model unknown multi-agent consistency control method based on reinforcement learning, and belongs to the field of intellectualization. The scheme adopted in the invention firstly when designing the self-adaptive distributed observer consists of three steps. First, an adaptive distributed observer was designed to estimate the state of the system matrix and the leader system. Secondly, a method for calculating the equation solution of the observer on line is provided after the adaptive distributed observer is designed. Thirdly, in order to eliminate few extreme cases, under the condition that each follower does not know the matrix of the leader system, the problem of distributed consistency output adjustment of the system is solved by integrating adaptive state feedback and adaptive measurement output feedback control. According to the estimated state, the controller is designed by adopting a reinforcement learning-based method, and the optimal solution is obtained by an iteration method, so that the optimal control of the multi-agent system is realized.

Description

Model unknown multi-agent consistency control method based on reinforcement learning

Technical Field

The invention belongs to the field of intellectualization, and relates to a model unknown multi-agent consistency control method based on reinforcement learning.

Background

The research on the problem of consistency control of multi-agent systems dates back to the 80 s of the last century, and the research on related multi-agent technologies began at the earliest from the research on mobile robots. The field of multi-agent system consistency control research has developed rapidly over the last fifteen years, and many new systems have been proposed that have extended from military operations to mobile sensor networks, commercial highways, air transportation, and emergency and disaster relief. However, with the constraint of control quality, the distributed optimal consistency problem is always a great challenge in the control field nowadays. The distributed consistency of a multi-agent system not only needs to satisfy the consistency of the behavior of each agent, but also needs to optimize the performance index of the whole system. In a more rigorous sense, distributed consistency control of multi-agent systems is to be agreed upon at as low a cost as possible. However, the mainstream outstanding learner for researching multi-agent control has given various ideas aiming at the problem of consistency control of multi-agent systems: such as linear quadratic regulation techniques, adaptive learning methods, model predictive control techniques, and fuzzy adaptive dynamic programming.

In recent decades, Reinforcement Learning (RL) has been a control protocol that can be designed without knowledge or identification of system dynamics, and is not model-based, and thus has gained much attention and broad application prospects. Reinforcement learning is motivated by biological systems to find the optimal control strategy by optimizing cumulative rewards, interacting with a given unknown environment to learn the best strategy to maximize its long-term performance. The Rl algorithm is based on the fact that a successful control strategy should be remembered, and then by reinforcing this signal, they are more likely to be used a second time. From the beginning of reinforcement learning research, the RL method has gained much attention in the field of intelligent agent research. The research on the relevant reinforcement learning which is mainstream at present is usually realized on an operator-critic structure, a critic evaluates the performance of the current strategy according to measured data, and a performer finds an improved strategy by using the strategy evaluated by the critic. Compared with a classical dynamic planning method, the reinforcement learning method provides a feasible method capable of avoiding dimension explosion. On the other hand, compared with the traditional adaptive controller, the reinforcement learning method only needs to consider the dynamics of the tracking error, can reduce the transient response bringing errors to the system to the maximum extent, and simultaneously ensures the stability of the whole system. The main advantage of the Reinforcement Learning (RL) algorithm to solve the optimal control problem is that it can obtain enough data information from the system without having to solve the system dynamics, and then iteratively improve between two steps of policy evaluation and policy improvement based on a policy iteration technique.

In the research on multi-agent consistency control, the system is partially unknown, and the follower can observe the state of the leader at any time, so that the whole system achieves behavior consistency by constructing a communication network between the leader and the follower and a communication network between the followers. In most cases, the state of the system cannot be directly measured by the sensor, but the input and output of the system can be measured by various methods. The popular method of today is to make an estimate of the state of the system by constructing a full-dimensional observer; such as considering a simple linear system as follows:

an analog linear system identical to this system was constructed at the same time:

where ω and γ are the input and output of the simulation system, and are also estimates of the original system, designing the simulation system andthe error value of the original system is e-omega-x, in order to make the state estimation error e tend to 0, the state estimation error e can be converted into an output estimation error gamma-y which can be measured, according to the general principle of feedback control, only the output estimation error gamma-y is required to be fed back to the state of the analog system

The controller is then designed so that the output estimation error approaches 0, i.e. the state estimation error also approaches 0 at this time. And introducing a state observer to output a feedback matrix H to obtain the following form:

the output equation of the original system and the output equation of the full-dimensional state observer are brought into the state equation of the full-dimensional state observer to obtain:

where A-HC is the system matrix of the full-dimensional observer. The key problem in designing a full-dimensional state observer is to ensure that the state estimation error approaches 0 under any initial condition, i.e. the method is applied to the field of view

The method can be obtained by an original system and a full-dimensional observer system:

solving the above equation to obtain

From the solution, x (t) ₀ )≠ω(t ₀ )；

When x (t) is ω (t), it is always true, and when x (t) ₀ )＝ω(t ₀ ) Then, the feedback matrix H only needs to be adjusted so that the eigenvalues of the a-HC matrix have negative real parts.

In a multi-agent system, a state value is difficult to observe accurately in real time, a corresponding self-adaptive distributed observer is designed for each follower in the design, and then under the guidance of an output regulation theory, a collaborative self-adaptive optimal output regulation problem is decomposed into a feedforward control design problem and a self-adaptive optimal feedback control problem, wherein the feedforward control design problem and the self-adaptive optimal feedback control problem can solve nonlinear regulator equations. And finally, establishing an HJB (Hamilton Jacobi Bellman) equation of the established cost function by establishing the cost function of the multi-agent system based on reinforcement learning, solving the HJB equation by a method based on synchronous reinforcement learning, and finally obtaining an optimal solution by an iterative method to realize the optimal control of the multi-agent system.

Disclosure of Invention

In view of the above, the present invention provides a model-unknown multi-agent consistency control method based on reinforcement learning. The optimal feedback control problem of the multi-agent and the design problem of the optimal controller under the conditions that the leader model is unknown and the follower state is not measurable are solved by establishing a model-state-based multi-agent input structure and designing a corresponding reinforcement learning algorithm to solve an HJB equation.

In order to achieve the purpose, the invention provides the following technical scheme:

a model unknown multi-agent consistency control method based on reinforcement learning comprises the following steps:

s1: performing single agent optimal output control based on reinforcement learning;

s2: multi-agent consistency control based on reinforcement learning.

Optionally, the S1 specifically includes:

when the optimal controller of a single intelligent agent is designed, a non-strategy reinforcement learning algorithm is adopted to learn and track the solution of the HJB equation on line, and the following system models are considered:

where x, u are the status and control inputs of the system, respectively, and d is the external disturbance; assuming that f (x), g (x), l (x) are Lipchitz functions, and f (0) ═ 0, the system is robust and stable;

assume that p (t) is a condition of consistency that needs to be achieved and satisfies the following form:

and h (0) is equal to 0, the tracking error of the whole system is defined as:

simultaneous reaction of (1-1), (1-2) and (1-3) to obtain:

the following virtual performance outputs are defined to meet the requirements:

defining a performance function for the system:

suppose the system is in u ^* The initial satisfaction cost is minimized, then

The following bellman equation is given:

wherein,

is an amplification system designed for the system;

according to the optimization conditions

And

obtaining an optimal control input and an optimal interference input:

wherein V ^* Is the optimized value function defined in (1-7);

and (4) obtaining the following HJB equation of track tracking according to the optimal input conditions obtained in the step (1-10):

thus, the following single agent offline RL algorithm is obtained:

solving HJB equation based on RL algorithm

Step 11: initialization: given an allowable stability control strategy value u ₀

Step 12: and (3) policy evaluation: for oneA control input u _i And a disturbance input d _i The following Bellman equation was used:

step 13: interference d of updating system _i ：

Step 14: updating input u of system _i ：

Step 15: step 11 is re-executed.

Optionally, the S2 specifically includes:

s21: establishing a graph theory:

let G ═ (V, E, a) be a weighted graph, which describes the information channels between N agents; v is follower node { V ₁ ，v ₂ ，…v _N A non-empty finite set of };

is an edge set;

is a weighted adjacency matrix, and when (v) _i ，v _j ) E is epsilon, a _ij Is greater than 0; if it is

a _ij 0 and for all

i

1,2, … N, a _ij 0; definition of N _i ＝{v _j ∈V:(v _i ,v _j ) E represents follower v _i Neighbor follower set of (i.e. N) _i All followers in (1) directly send information to the followers v _i Definition ofMatrix D ═ diag (D) ₁ ,d ₂ ,...,d _N ) Is an in-degree matrix, wherein

1,2, N; laplacian matrix L ═ D-a ═ L for directed graph G ═ (V, E, a) _ij ]Wherein l is _ij ＝-a _ij ，

The sum of each row of the Laplace matrix L is zero, i.e., 1 _N A right eigenvector of the laplacian matrix L is set, and its corresponding eigenvalue is zero; for a spanning tree, if there is only one node v _i Indicating that there is a directed path from one node to any other node in the graph; there is a directed path from each node to any other node; for graphs with spanning trees, strong connectivity is a sufficiently unnecessary condition;

s22: problem description:

considering a multi-agent system consisting of one leader and N followers, and considering a form with a directed graph of communication g (x), the kinetic model of the ith follower is:

wherein

And

respectively the status and input of the ith follower,

and

internal function and input matrix function of the ith follower, respectively, and assume f _i (x _i )，g _i (x _i ) Is unknown, has f _i (0) The system (2-1) has robust stability as 0;

the leader's kinetic model was:

wherein

In the form of the status of the leader,

unknown, D is a constant matrix, which is differentiable, and bounded | | | f (x) ₀ )||≤ρ ₀ ；

According to the network topological relation between each follower and the adjacent agent, the local domain consistency error of the system is described as follows:

wherein

And b is a _i ≧ 0, if and only if b _i When the number is more than 0, the ith agent and the leader have communication; consistency information of multi-agent system is composed of consistency error e of local area _i When t → ∞, e _i → 0, the multi agent system will agree;

s23: adaptive distributed observer

By designing the adaptive distributed observer for each follower, the problem that in a multi-agent system, the follower estimates the state of a leader in real time under the condition that the state of the leader is unknown is solved, and the state of the follower relative to the leader is converted into the state of the adaptive distributed observer relative to the leader;

wherein, the self-adaptive distributed observer is as follows:

wherein x ₀ ＝x，

D ₀ ＝D，

Mu is more than 0; under the error description of the system, satisfy

i is 1,2, …, N, and satisfies

The adaptive distributed observer comprises a mechanism for estimating a matrix D, and a near unit of a leader knows the matrix;

using estimated values S of S _i From the solution of the adaptive calculation equation, the following observer form is obtained:

s24: designing a multi-agent system controller based on reinforcement learning;

consider the following system model:

x _k+1 ＝f(x _k )+g(x _k )u _k (4-1)

wherein,

is the state of the system and is,

is the control input of the system, the system model can also use more concise x _k+1 ＝F(x _k ，u _k ) Represents;

for each state x of a multi-agent system _k The following control strategies are defined:

u _k ＝h(x _k ) (4-2)

the mapping form is also called as a feedback controller, and in the field of feedback control, a feedback control strategy is designed in many ways, including Riccati equation optimal solution, adaptive control, h-infinity control and classical frequency domain control;

to obtain the optimal control strategy of the system, the following cost function is designed for the system:

wherein, the discount factor 0 is more than gamma and less than or equal to 1, u _k ＝h(x _k ) Is a control strategy in the design;

or given in standard quadratic form:

suppose the system is at V ^* The cost paid by the place is minimum, and the optimal cost strategy is as follows:

when the optimal control strategy is taken, the optimal control value given by the system is as follows:

in the original system, the leader of the multi-agent system is considered to have the following model:

x _k+1 ＝f(x _k ) (4-7)

with a communication network diagram of a given system, the local coherence error of the system is defined as:

consensus information for multi-agent systems is represented by the above-mentioned systematic consensus error for the local domain, i.e., when t → 0, e _i → 0, indicating that the system is about uniform;

designing an additional compensator, independent of the individual subsystems, defined by the input affine differential equation that is expected:

and combining the knowledge of the corresponding graph theory to obtain a global error form of (4-10):

e＝L'(x-x ₀ ) (4-10)

wherein,

and is provided with

Satisfies b _ii ＝b _i When i ≠ j, b _ij ＝0

After the local error e is derived after the simultaneous operations (2-1) and (4-10), the local region consistency error is obtained relative to the graph G (x) as follows:

wherein, f _e (t)＝f(x(t))-f(x(0))，

L _i The ith column vector of the Laplace matrix is represented; in combination with (4-10) and (4-11), the local domain coherence error is expressed as:

wherein,

and satisfies the following conditions:

likewise, returning to the system model at the beginning of the designed continuous time:

given a cost function for continuous-time multi-agent system consistency control:

then the correlation tracking Bellman equation is obtained by the affine differential equations defined by (4-9) and (4-13) using the Leibniz rule:

where U (u) is a positive definite integrand on the control input u:

then (4-15) is expressed by the following equation:

then, the following Hamiltonian equation is defined:

does not let V stand ^* If the system optimal control cost is obtained, the optimal cost function is defined as follows:

at an optimum cost V ^* Next, according to the Hamiltonian equation in (4-18), the following HJB equation is obtained:

when the stability condition is satisfied

Then, the following optimal control inputs are obtained:

the following strategy iteration algorithm is obtained:

the algorithm is as follows: HJB equation solving method based on strategy iteration method

Step 211: and (3) policy evaluation: given control input u ⁱ (x) Solving for V by the following Bellman equation ⁱ (X)

Step 212: (strategy improvement) the control strategy is updated by:

step 213: order to

Returning to step 211 until a minimum value is converged;

an integral reinforcement learning algorithm is introduced into a strategy iterative algorithm, and on a discrete time system (4-1), for any integral interval T >0, a value function in a continuous system (4-13) meets the following form:

tracking the solution of the Bellman equation by using an integral reinforcement learning algorithm, and solving the HJB equation by using integral reinforcement chemistry under the condition that a system dynamic model is unknown;

obtaining the following integral reinforcement learning algorithm based on strategy iteration:

the algorithm is as follows: strategy iteration-based HJB equation solving method by offline integral reinforcement learning algorithm

Step 221: and (3) policy evaluation: given control input u ⁱ (x) Solving for V by the following Bellman equation ⁱ (X)

Step 222: strategy improvement: the control strategy is updated by:

step 223: order to

Returning to step 221 until convergence to a minimum value;

s25: design of self-adaptive distributed observer based on reinforcement learning algorithm to realize consistency distributed control of multiple intelligent agents

Multi-agent system:

x _i (k+1)＝f _i (x(k))+g _i (x(k))u(k)

y _i (k)＝cx _i (k) (5-1)

wherein x is _i ,u _i Y, respectively representing the state of the ith agent of the system, controlling inputs and outputs;

the leader model takes the following form into account:

ν(k+1)＝Eν(k)

in the leader model referred to herein,

is the state of the leader system when agent i satisfies (v) ₀ ，ν _i ) E epsilon, when a communication connection exists between the follower i and the leader,

representing a known constant matrix; q satisfies the condition that Q (0) is 0,

is an external reference signal;

the observer is arranged:

wherein R is _i (k) Represents an observed value of agent i relative to the leader at time k, and satisfies R ₀ (k)＝ν(k),W ₀ (k)＝W,

According to the system description, a cost function of the system is obtained according to the derivation of the optimal output problem formula of the linear system:

wherein, i is 1,2 _i Is the discount factor of the number of bits of the file,

c＝[1,0,0，…0]the optimal feedback input for each folower is obtained by solving equation (5-4):

writing (5-4) as a quadratic function when solving for the function optimal feedback inputNumber form, expressed as a function of the value of the system:

the following bellman equation is obtained:

according to the above bellman equation, the HJB equation in nonlinear optimal feedback is defined as:

when the stability condition is satisfied

Then, the following optimal control inputs are obtained:

wherein,

solving the HJB equation by adopting strategy iteration of IRL;

obtaining the following strategy iteration-based online IRL multi-agent optimal feedback control algorithm:

the algorithm is as follows: online IRL algorithm solution HJB equation based on strategy iteration

Step 231: initialization: selecting a control input

Repeating the following steps to know the convergence of the system;

step 232: strategy improvement: the control strategy is updated by:

step 233: let u _i (k)＝u _i+1 (k) Return to step 231 until V _i (k) Convergence to a minimum value;

on the basis of the system (5-1) and (5-2) models, consider the following first-order multi-agent system:

wherein,

respectively representing the state and control input of the ith agent at time k; tau. _ij ≧ 0 denotes the communication time lag, τ, for data from agent j to agent i _i The input time lag of the agent i is more than or equal to 0; consider a first-order discrete multi-agent system comprising n agents, the network topology of which is a static directed weighted graph and which comprises a globally reachable node if satisfied

Then there are: max { d { _i (2τ _i +1)}＜1

The system is able to achieve a gradual agreement in which,

assume that a multi-agent system contains 5 nodes, and their corresponding adjacency matrices are as follows:

according to the setting, the input time lag of the intelligent agent should satisfy

Suppose that: tau. ₁₃ ＝1s，τ ₂₁ ＝0.75s，τ ₃₂ ＝1.8s，τ ₄₂ ＝2s，τ ₅₁ When the input time lag tau is 0.8s and the initial state of the agent is randomly generated to be x (0) x (2.5,3,2,3.5,5), the agents finally and gradually get consistent; and the input time lag is changed into 3s, and the systems still achieve consistency.

The invention has the beneficial effects that:

1. the design combines with the related technology of dynamic planning, and provides a single intelligent agent reinforcement learning method based on strategy iteration;

2. the invention designs a self-adaptive distributed observer, which estimates the unmeasured state of a multi-agent system through the output information of an observer system, and solves the defect that the traditional full-dimensional observer cannot carry out information communication in real time;

3. the invention provides a method for calculating an observer equation solution on line, which solves the problems that the observer equation needs to be subjected to multi-order derivation and a Lyapunov function is constructed in multiple steps in the traditional method;

4. the invention provides a method for approximating the constructed value function by a strategy iterative algorithm based on integral reinforcement learning by means of an output adjustment technology to solve the HJB equation, thereby solving the problem that the traditional method is difficult to solve the HJB equation to obtain the optimal solution;

5. the invention adopts a reinforcement learning-based method to design a controller, obtains an optimal solution through an iteration method, integrates the knowledge in the aspect of the existing reinforcement learning, constructs a method for solving the problem of output adjustment of the distributed consistency of the system by adopting self-adaptive state feedback and self-adaptive measurement output feedback control under the condition that each follower does not know the state of the leader system, and realizes the distributed consistency control of the multi-agent system.

6. The invention provides a consistency control method based on reinforcement learning under different time lags, which considers the problem of constraint input, selects a required performance function as a non-quadratic penalty function, and then considers the problems of communication time lag between a system folower and a leader and input time lag between folwers.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For a better understanding of the objects, aspects and advantages of the present invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a diagram of an intelligent system node connection topology;

fig. 2 is a simulation result diagram.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Aiming at the problem of distributed consistency control of a multi-agent system, the invention solves the consistency output regulation problem of a linear multi-agent system by designing a self-adaptive distributed observer for each follower on the premise that each follower knows a system matrix S of the leader system. The scheme adopted in the invention firstly when designing the self-adaptive distributed observer consists of three steps. First, an adaptive distributed observer is designed to estimate the state of the system matrix and the leader system. Secondly, a method for calculating the equation solution of the observer on line is provided after the adaptive distributed observer is designed. Thirdly, in order to eliminate few extreme cases, under the condition that each follower does not know the matrix of the leader system, the problem of distributed consistency output adjustment of the system is solved by integrating adaptive state feedback and adaptive measurement output feedback control.

After the self-adaptive distributed observer is constructed, the unmeasured state of the multi-agent system is estimated through the output information of the observer system, then, based on the estimated state, an optimal controller is designed by adopting a reinforcement learning method, meanwhile, the problem of constraint input is considered, the required performance function can be selected as a non-quadratic penalty function, then, the problems of communication time lag between a folower and a leader and input time lag between the folower are considered, and the consistency problem under different time lags is verified. The invention designs an ADP method-based reinforcement learning controller, which establishes an HJB (Hamilton Jacobi Bellman) equation of the established cost function by establishing the cost function of a reinforcement learning-based multi-agent system, solves the HJB equation by a synchronous reinforcement learning-based method, and finally obtains an optimal solution by an iterative method to realize the optimal control of the multi-agent system.

In order to achieve the purpose, the technical scheme of the invention is as follows:

the multi-agent distributed consistency control method adopts the self-adaptive distributed observer to estimate the state of the system, adopts a reinforcement learning based method to design the controller according to the estimated state, and obtains the optimal solution through an iterative method to realize the optimal control of the multi-agent system.

First part is based on single intelligent agent optimal output control of reinforcement learning

When the optimal controller of the single intelligent agent is designed, only the solution of the HJB equation needs to be learned and tracked on line by adopting a non-strategy reinforcement learning algorithm without any system dynamics knowledge

Consider the following system model:

where x, u are the status and control inputs of the system, respectively, and d is an external disturbance. Assuming that f (x), g (x), l (x) are Lipchitz functions, and f (0) ═ 0, the system is robust and stable.

Let p (t) be the consistency condition that needs to be achieved and satisfy the following form:

and h (0) ═ 0, the tracking error of the whole system is defined as:

simultaneous reaction of (1-1), (1-2) and (1-3) to obtain:

the following virtual performance outputs are defined to meet the requirements:

the following performance function (cost function) is defined for the system:

suppose the system is in u ^* The initial satisfaction cost is minimized, then

The following bellman equation is given:

wherein,

is an amplification system designed for the system.

According to the optimum conditions

And

can obtain optimal control outputEntering and optimal interference input:

wherein V ^* Is the optimization value function defined in (1-7).

From the optimal input conditions obtained in (1-10), the following trajectory tracking hjb (hamilton Jacobi bellman) equation can be obtained:

thus, the following single agent offline RL algorithm is obtained:

second part reinforcement learning based multi-agent consistency control

Firstly, graph theory:

let G ═ V, E, a be a weighted graph, which describes the information channels between N agents. V is follower node { V ₁ ，v ₂ ，…v _N A non-empty finite set of };

is an edge set;

is a weighted adjacency matrix, and when (v) _i ，v _j ) E is E, a _ij Is greater than 0; if it is

a _ij 0 and for all

i

1,2, … N, a _ij 0. Definition of N _i ＝{v _j ∈V:(v _i ,v _j ) E represents follower v _i Neighbor follower set of (i.e. N) _i All followers in (1) can directly send information to the follower v _i In the definition matrix D ═ diag (D) ₁ ,d ₂ ,...,d _N ) Is an in-degree matrix, wherein

1, 2. Then, the laplacian matrix L ═ D-a ═ L of the directed graph G ═ (V, E, a) _ij ]Wherein l is _ij ＝-a _ij ，

It is thus easy to know that the sum of each row of the laplace matrix L is zero, i.e. 1 _N Is a right eigenvector of the laplacian matrix L, whose corresponding eigenvalue is zero. In the invention, only a simple graph is considered, only one spanning tree is provided, if only one node v is provided _i Indicating that there is a directed path from one node to any other node in the graph. Although the directed graph is strongly connected, there is a directed path from each node to any other node. For graphs with spanning trees, strong connectivity is a sufficiently unnecessary condition.

Secondly, problem description:

wherein

And

respectively the status and input of the ith follower,

and

internal function and input matrix function of the ith follower, respectively, and assume f _i (x _i )，g _i (x _i ) Is unknown, has f _i (0) The system (2-1) has robust stability at 0.

The leader's kinetic model was:

wherein

Is the state of the leader and is,

unknown, D is a constant matrix, but it is assumed that it is differentiable and bounded | | | f (x) ₀ )||≤ρ ₀ 。

According to the network topological relation between each follower and the adjacent agent, the local field consistency error of the system can be described as follows;

wherein

And b is _i ≧ 0, if and only if b _i >0, there is communication between the ith agent and the leader. As can be seen from the above equation, the consistency information of the multi-agent systemInformation can be represented by a local domain coherence error e _i To represent, when t → ∞ is reached, e _i → 0, the multi-agent system will agree.

Adaptive distributed observer

In the invention, by designing the adaptive distributed observer for each follower, the problem that in a multi-agent system, in the case of unknown state of a leader, the follower can estimate the state of the leader in real time can be solved, and the state of the follower relative to the leader can be converted into the state of the adaptive distributed observer relative to the leader in the design.

The involved adaptive distributed observer is as follows:

wherein x ₀ ＝x，

D ₀ ＝D，

Mu is more than 0; under the error description of the system, satisfy

i is 1,2, …, N, and satisfies

This observer contains a mechanism for estimating the matrix D, hence the so-called adaptive distributed observer, which is known from (1-3) only from the leader's close-up unit, and is therefore more stable and accurate than the generic distributed observer.

Since the control law needs to utilize the designTo provide a suitable feed forward control to achieve the control objective, and the solution of the equation depends on the system matrix S of the system. Since the follower in the leader child node does not know the matrix S, it is proposed to use S _i (estimated value of S) is derived from the solution of the adaptive calculation equation, so the following observer form can be obtained:

fourth, multi-agent system controller design based on reinforcement learning

Before the research on the consistency control of multi-agent systems, a specific dynamic model needs to be discussed, and the research on multi-agent systems is mostly based on the research on Adaptive Dynamic Programming (ADP), and most of the research on adaptive dynamic programming is performed on systems operating in Discrete Time (DT). Therefore, considering the nonlinear discrete time system first, some methods for optimally controlling the discrete time system are summarized, and then an online reinforcement learning scheme is designed for the discrete time system by combining a reinforcement learning method with a linear quadratic regulation technology (LQR). Consider the following system model:

x _k+1 ＝f(x _k )+g(x _k )u _k (4-1)

wherein,

it is the state of the system that is,

is the control input of the system, the system model can also use more concise x _k+1 ＝F(x _k ，u _k ) And (4) showing.

u _k ＝h(x _k ) (4-2)

the above mapping form is also called as a feedback controller, and in the field of feedback control, the design of a feedback control strategy is many, including Riccati equation optimal solution, adaptive control, h ∞ control and classical frequency domain control. In the reinforcement learning scheme of the invention, the control strategy is learned in real time according to the stimulation of the environment.

To obtain the optimal control strategy for the system, the following cost function is now designed for the system:

wherein, the discount factor 0 is more than gamma and less than or equal to 1, u _k ＝h(x _k ) Is the control strategy in the design. It can also be given in the following more general standard quadratic form:

suppose the system is at V ^* If the cost paid is the minimum, the optimal cost policy is as follows:

x _k+1 ＝f(x _k ) (4-7)

as can be seen from the above formula, the consensus information of the multi-agent system can be represented by the above system consensus error of the local area, i.e., when t → 0, e _i → 0, the system tends to be consistent.

In order to overcome the defect that the self-adaptive dynamic planning method based on reinforcement learning strongly depends on each intelligent system model, an additional compensator is designed, can not depend on each subsystem, and is defined by an expected input affine differential equation as follows:

e＝L'(x-x ₀ ) (4-10)

wherein,

and is provided with

Satisfy b _ii ＝b _i When i ≠ j, b _ij ＝0

wherein, f _e (t)＝f(x(t))-f(x(0))，

L _i The ith column vector of the laplacian matrix is represented. Combining (1-11) and (1-12), the local domain coherence error can be expressed as:

wherein,

and satisfies the following conditions:

then the correlation tracking Bellman equation can be obtained by the affine differential equations defined by (4-9) and (4-13) using the Leibniz rule:

where U (u) is a positive definite integrand on the control input u:

then (4-15) can be expressed by the following equation:

then, the following Hamiltonian (Hamiltonian) equation is defined:

at an optimum cost V ^* Next, from the Hamiltonian equation in (4-18), the following HJB (Hamilton Jacobi Bellman) equation can be obtained:

when the stability condition is satisfied

Then, the following optimal control inputs can be obtained:

the following strategy iteration algorithm is obtained:

in order to realize the optimal consistency control of the multi-agent system under the condition of unknown system dynamics, an integral reinforcement learning algorithm can be introduced under the method of the strategy iteration, and on the discrete time system (4-1), for any integral interval T >0, a value function in the continuous system (4-13) can satisfy the following form:

then, tracking the solution of the Bellman equation by using an integral reinforcement learning algorithm, and solving the HJB equation by using integral reinforcement chemistry can be realized under the condition that a system dynamic model is unknown.

fifthly, designing a self-adaptive distributed observer based on a reinforcement learning algorithm to realize the consistency distributed control of multiple intelligent agents

In this section, the output information of the observer system designed in the third section is combined to estimate the unmeasured state of the multi-agent system, then, based on the estimated state, an optimal controller is designed by adopting a reinforcement learning method, meanwhile, the required performance function can be selected as a non-quadratic penalty function by considering the constraint input problem, then, the communication time lag between the follower and the system and the input time lag between the follower are considered, and the consistency problem under different conditions of time lag is verified in this section.

Consider the following multi-agent system:

x _i (k+1)＝f _i (x(k))+g _i (x(k))u(k)

y _i (k)＝cx _i (k) (5-1)

wherein x is _i ,u _i Y, respectively representing the status of the i-th agent of the system, the control input andand (6) outputting.

The leader model takes the following form into account:

ν(k+1)＝Eν(k)

in the leader model referred to in the above,

is the state of the leader system when agent i satisfies (v) ₀ ，ν _i ) E epsilon (i.e. when there is a communication connection between the follower i and the leader)

Representing a known constant matrix. Q satisfies the condition that Q (0) is 0,

is an external reference signal.

Considering that the adaptive distributed observer designed in the third part has unpredictability in parameter design and is not suitable for discrete time system, a simpler and more easily designed observer is adopted in the present part:

According to the above system description, a cost function of the system can be obtained according to the derivation of the optimal output problem formula of the linear system:

wherein, i is 1,2 _i Is the discount factor that is to be used,

c＝[1,0,0，…0]by solving equation (5-4), the optimal feedback input for each follower can be obtained:

in solving for the function optimal feedback input, (5-4) can be written in the form of a quadratic function, expressed as a systematic value function:

the following bellman equation is obtained:

this is a consistency equation satisfying a value function, and according to the above bellman equation, the HJB equation in the nonlinear optimal feedback can be defined as:

when it is stabilizedSexual conditions

Then, the following optimal control inputs can be obtained:

wherein,

because the HJB equation is difficult to solve, the strategy iteration of the IRL is adopted in the algorithm to solve the HJB equation.

now, the problem of communication time lag between the system follower and the leader and the problem of input time lag between the followers are considered, and the consistency problem under different time lags is verified in this section.

On the basis of the original system (5-1) and (5-2) models, the following first-order multi-agent system is considered:

wherein,

respectively representing the status and control inputs of the ith agent at time k. Tau. _ij ≧ 0 denotes communication time lag, τ, of data from agent j to agent i _i And more than or equal to 0 represents the input time lag of the agent i. Consider a first-order discrete multi-agent system comprising n agents, the network topology of which is a static directed weighted graph and which comprises a globally reachable node if satisfied

Then there are: max { d { _i (2τ _i +1)}＜1

The system is able to achieve a gradual agreement in which,

suppose a multi-agent system comprises 5 nodes, the connection topology of which is shown in FIG. 1, which is a directed weighted graph with node 1 being globally reachable. From the network topology of fig. 1, the corresponding adjacency matrix can be obtained as follows:

Now, the following assumptions are made: tau is ₁₃ ＝1s，τ ₂₁ ＝0.75s，τ ₃₂ ＝1.8s，τ ₄₂ ＝2s，τ ₅₁ When the input time lag τ is 0.5s and the initial state of the agent is randomly generated as x (0) — (2.5,3,2,3.5,5), the simulation result is shown in fig. 2, and it can be seen that the agents eventually gradually agree. Then, the input skew is changed to 3s, and the system can still achieve coincidence.

Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A model unknown multi-agent consistency control method based on reinforcement learning is characterized in that: the method comprises the following steps:

s1: performing optimal output control of the single agent based on reinforcement learning;

s2: multi-agent consistency control based on reinforcement learning;

the S1 specifically includes:

when designing the optimal controller of a single intelligent agent, a non-strategy reinforcement learning algorithm is adopted to learn and track the solution of the HJB equation on line, and the following system models are considered:

where x, u are the status and control inputs of the system, respectively, and d is external interference; assuming that f (x), g (x), l (x) are Lipchitz functions, and f (0) ═ 0, the system is robust and stable;

and h (0) is equal to 0, the tracking error of the whole system is defined as:

simultaneous reaction of (1-1), (1-2) and (1-3) to obtain:

the following virtual performance outputs are defined to meet the requirements:

defining a performance function for the system:

suppose the system is in u ^* The initial satisfaction cost is minimized, then

The following bellman equation is given:

wherein,

is an amplification system designed for the system;

according to the optimum conditions

And

obtaining an optimal control input and an optimal interference input:

wherein V ^* Is the optimized value function defined in (1-7);

thus, the following single agent offline RL algorithm is obtained:

solving HJB equation based on RL algorithm

Step 12: and (3) policy evaluation: for a control input u _i And a disturbance input d _i The following Bellman equation was used:

step 13: interference d of updating system _i ：

Step 14: updating input u of system _i ：

Step 15: re-execution of step 11

The S2 specifically includes:

s21: establishing a graph theory:

let G ═ V, E, a be a weighted graph, which describes the information channels between N agents; v is follower node { V ₁ ，v ₂ ，…v _N A non-empty finite set of };

is an edge set;

a _ij 0 and for all i 1,2, … N, a _ij 0; definition of N _i ＝{v _j ∈V:(v _i ,v _j ) E represents follower v _i Neighbor follower set of (i.e. N) _i All followers in (1) send information directly to the follower v _i Defining the matrix D ═ diag (D) ₁ ,d ₂ ,...,d _N ) Is an in-degree matrix, wherein

Laplacian matrix L ═ D-a ═ L for directed graph G ═ (V, E, a) _ij ]Wherein l is _ij ＝-a _ij ，

The sum of each row of the laplace matrix L is zero, i.e. 1 _N A right eigenvector of the Laplace matrix L, the corresponding eigenvalue of which is zero; for a spanning tree, if there is only one node v _i Indicating that there is a directed path from one node to any other node in the graph; there is a directed path from each node to any other node; for graphs with spanning trees, strong connectivity is a sufficiently unnecessary condition;

s22: problem description:

wherein

And

respectively the status and input of the ith follower,

and

the leader's kinetic model was:

wherein

In the form of the status of the leader,

wherein

And b is _i ≧ 0, if and only if b _i When the number is more than 0, the ith agent and the leader have communication; consistency information of multi-agent system is composed of consistency error e of local area _i When t → ∞, e _i → 0, the multi-agent system will agree;

s23: adaptive distributed observer

wherein, the self-adaptive distributed observer is as follows:

wherein x ₀ ＝x，

D ₀ ＝D，

Mu is more than 0; under the error description of the system, satisfy

And satisfy

The adaptive distributed observer comprises a mechanism for estimating a matrix D, which is known only by the leader's approach unit;

using estimated values S of S _i From the solution of the adapted calculation equation, the following observer form is obtained:

s24: designing a multi-agent system controller based on reinforcement learning;

consider the following system model:

x _k+1 ＝f(x _k )+g(x _k )u _k (4-1)

wherein,

it is the state of the system that is,

u _k ＝h(x _k ) (4-2)

the mapping form is also called as a feedback controller, and in the field of feedback control, feedback control strategies are designed greatly and comprise Riccati equation optimal solution, adaptive control, h-infinity control and classical frequency domain control;

in order to obtain the optimal control strategy of the system, the following cost function is designed for the system:

or given in standard quadratic form:

x _k+1 ＝f(x _k ) (4-7)

e＝L'(x-x ₀ ) (4-10)

wherein,

and is provided with

Satisfy b _ii ＝b _i When i ≠ j, b _ij ＝0

wherein f is _e (t)＝f(x(t))-f(x(0))，

wherein,

and satisfies the following conditions:

likewise, return to the system model at the continuous time as originally designed:

where U (u) is a positive definite integrand on the control input u:

then (4-15) is expressed by the following equation:

then, the following Hamiltonian equation is defined:

when the stability condition is satisfied

Then, the following optimal control inputs are obtained:

the following strategy iteration algorithm is obtained:

Step 211: and (3) policy evaluation: given a control input u ⁱ (x) Solving for V by the Bellman equation ⁱ (X)

Step 212: (strategy improvement) the control strategy is updated by:

step 213: order to

Returning to step 211 until convergence to a minimum value;

Step 221: and (3) policy evaluation: given control input u ⁱ (x) Solving for V by the Bellman equation ⁱ (X)

Step 222: strategy improvement: the control strategy is updated by:

step 223: order to

Returning to step 221 until a minimum value is converged;

s25: an adaptive distributed observer is designed based on a reinforcement learning algorithm to realize a multi-agent consistency distributed control multi-agent system:

x _i (k+1)＝f _i (x(k))+g _i (x(k))u _i (k)

y _i (k)＝cx _i (k) (5-1)

wherein x is _i ,u _i ,y _i Respectively representing the state of the ith agent of the system, and controlling input and output;

the leader model considers the form as u ():

ν(k+1)＝Eν(k)

in the leader model referred to herein,

is an external reference signal;

the observer is arranged:

wherein, i is 1,2 _i Is the discount factor that is to be used,

c＝[1,0,0，...0]the optimal feedback input for each folower is obtained by solving equation (5-4):

when solving for the function optimal feedback input, (5-4) is written as a quadratic function, expressed as a systematic value function:

the following bellman equation is obtained:

when the stability condition is satisfied

Then, the following optimal control inputs are obtained:

wherein,

solving the HJB equation by adopting strategy iteration of IRL;

Step 231: initialization: selecting a control input

Repeating the following steps until the system converges;

step 232: strategy improvement: the control strategy is updated by:

step 233: let u be _i (k)＝u _i+1 (k) Return to step 231 until V _i (k) Convergence to a minimum value;

on the basis of the system (5-1) and (5-2) models, consider the following first order multi-agent system:

wherein,

respectively representing the state and control input of the ith agent at time k; tau is _ij ≧ 0 denotes the communication time lag, τ, for data from agent j to agent i _i The input time lag of the agent i is more than or equal to 0; consider a first-order discrete multi-agent system comprising n agents, the network topology of which is a static directed weighted graph and which comprises a globally reachable node if satisfied

Then there are: max { d { _i (2τ _i +1)}＜1

The system is able to achieve a gradual agreement in which,

Suppose that: tau is ₁₃ ＝1s，τ ₂₁ ＝0.75s，τ ₃₂ ＝1.8s，τ ₄₂ ＝2s，τ ₅₁ When the input time lag tau is 0.8s and the initial state of the agent is randomly generated to be x (0) x (2.5,3,2,3.5,5), the agents finally and gradually get consistent; the input time lag is changed to 3s, and the system still achieves consistency.