CN116185020A

CN116185020A - Multi-agent formation control method based on single commentator reinforcement learning structure

Info

Publication number: CN116185020A
Application number: CN202310081638.1A
Authority: CN
Inventors: 黄捷
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2023-01-19
Filing date: 2023-01-19
Publication date: 2023-05-30

Abstract

The invention relates to a multi-agent formation control method based on a single commentator reinforcement learning structure, which comprises the following steps: constructing a communication structure of each intelligent agent of the multi-intelligent agent system; constructing tracking errors of the intelligent agents relative to the pilot intelligent agents, and constructing errors describing the intelligent agents and the pilot and the intelligent agents and the neighbor intelligent agents, namely formation errors; constructing a cost function and a value function related to formation errors and optimal control inputs based on optimal control; performing expansion solution on the value function to construct a corresponding HJB equation; solving the bias guide about the optimal control for the HJB equation to obtain the expression form of the optimal control input about the optimal value function; dividing the optimal value function to obtain a divided optimal control input form; and introducing a single commentator reinforcement learning structure, and solving the obtained segmented optimal value function and optimal control input by combining a neural network. The method is beneficial to reducing estimation errors and calculation time.

Description

Multi-agent formation control method based on single commentator reinforcement learning structure

Technical Field

The invention belongs to the technical field of multi-agent formation control, and particularly relates to a multi-agent formation control method based on a single commentator reinforcement learning structure.

Background

The multi-agent system comprises an autonomous and interactive entity which share a common environment, and the agents can sense and execute operations according to the environment, wherein formation control is an application field of the multi-agent, and comprises formation application scenes such as satellites, underwater robots, unmanned aerial vehicle flight and the like. Many formation control methods, such as a leader follower, a virtual structure method, and the like, have been proposed through researches by scientists. The leader follower formation control method is used as a simple formation control algorithm with the characteristic of expandability, is widely applied to multi-agent formation at present, and has the strategy that one agent is set as a leader and a moving track is set, and then a controller is designed to control other follower agents to track the leader track.

Optimal control is an effective method of balancing control performance and controlling resource consumption, which can achieve control goals by minimizing cost functions. The dynamic programming is used as a method in optimal control, has wide application value, and the basic idea is to decompose an optimal solution for solving a large problem into a plurality of small problems. However, the difficulty of the inverse solution process and dimension disaster of this approach has hindered its further application and development. The appearance of self-adaptive dynamic programming combines an optimal control method with a reinforcement learning structure, overcomes the defects of the dynamic programming, and can estimate an unknown equation through function approximation. The method for combining the actor-commentator reinforcement learning structure with the optimal control can effectively solve the problem of hard solution of the Hamiltonian-Gu Kebi-Bellman equation in the optimal controller. However, this approach involves iterations of the dual network by the actors and critics, which can result in more computational errors and longer computation times.

Therefore, designing a multi-agent system formation control method based on reinforcement learning structure to reduce calculation time and calculation error still belongs to the open problem. Aiming at the problem, the invention removes the actor network, redesigns the reviewer network updating strategy, and enables the reviewer network updating strategy to evaluate the performance and correct the performance in time while executing the control behavior. The method can effectively reduce calculation time and estimation errors, and ensure that the formation behavior of the nonlinear multi-agent system is successfully completed.

Disclosure of Invention

The invention aims to provide a multi-agent formation control method based on a single commentator reinforcement learning structure, which is beneficial to reducing estimation errors and calculation time.

In order to achieve the above purpose, the invention adopts the following technical scheme: a multi-agent formation control method based on a single commentator reinforcement learning structure comprises the following steps:

step one: based on graph theory in application mathematics, constructing a communication structure of each intelligent agent of the multi-intelligent agent system, considering that the system is a first-order multi-intelligent agent system, and only obtaining the position information of the neighbor intelligent agent by each intelligent agent; meanwhile, one navigator intelligent body exists in the system, and other intelligent bodies serve as followers to move along the track of the navigator intelligent body in the running process;

step two: aiming at each agent in the system, constructing tracking errors of the agent relative to the navigator agent according to the obtained neighbor agent information, and constructing errors for describing the agent and the navigator and the neighbor agent according to the tracking errors, namely formation errors;

step three: constructing a cost function and a value function related to formation errors and optimal control inputs based on optimal control;

step four: based on the Taylor formula and the value function obtained in the step three, performing expansion solution on the value function to obtain a corresponding Hamiltonian-Gu Kebi-Bellman equation;

step five: aiming at the Hamiltonian-Gu Kebi-Belman equation obtained in the step four, solving the partial derivative about optimal control to obtain the expression form of the optimal control input about an optimal value function;

step six: dividing the optimal value function to obtain the expression form of the optimal value function about formation errors and unknown functions, and obtaining a divided optimal control input form according to the optimal control input expression form of the step five;

step seven: and D, introducing a single commentator reinforcement learning structure, and solving the segmented optimal value function and optimal control input obtained in the step six by combining a neural network, wherein the neural network approximates unknown nonlinear terms in the multi-agent system, the commentator network carries out formation control of the agent system, and the effect of the formation control is evaluated and improved.

Further, the single reviewer reinforcement learning structure is used for removing the requirement on an actor network in the traditional actor-reviewer reinforcement learning method, so that the approximation error of the system is effectively reduced and the calculation time is reduced.

Further, in the first step, the model of the multi-agent system is expressed as:

wherein ,x_i (t) represents the amount of location of the ith agent in the system; u (u) _i (t) represents the control amount of the ith agent in the system; f (f) _i (. Cndot.) represents an unknown nonlinear function and is assumed to be Li Pu Hith continuous;

the model of the navigator agent is as follows:

wherein ,p_l and v_l The track and speed of the navigator are respectively represented, namely, the expected track and speed in formation movement; the tracking error of each intelligent agent relative to the pilot is set as follows:

z _i ＝x _i -p _l -ζ _i

wherein ,ζ_i Representing a location between a pilot agent and an ith follower agent for describing a formation shape of the system;

from the structure of the tracking error, a formation error form is defined as follows:

wherein ,a_ij An ith row j column of the adjacency matrix in the graph theory; b _i The connection weight parameter between the ith follower agent and the navigator agent is set; Λ type _i Representing the neighbor set of the ith agent.

Further, in the third step, the expression form of the cost function is obtained by combining the defined formation errors:

wherein c=dia. _g {c ₁ ，c ₂ ，…，c _i ，…，c _n }；

w ₁ and w₂ Constants are set for two; i _m Is a unit matrix with proper dimension; />

Is a tensor product sign;

according to the obtained cost function, a corresponding value function is established, and the optimal control input is introduced

The corresponding optimal value function is finally obtained and expressed as follows: />

Where τ represents the integration constant.

Further, in step four, a Hamiltonian-Gu Kebi-Belman equation is established as follows:

regarding the above equation

The expression to obtain the optimal control input is:

further, for unknown nonlinear term f existing within the multi-intelligent system _i (x _i ) Approximate estimation is performed by introducing a neural network:

wherein ,

representing an ideal neural network weight matrix; s is S _fi (x _i ) Representing a basis function vector; e-shaped article _fi (x _i ) Representing an approximation error;

due to

For theoretical analysis only but in practice an unknown matrix, so an estimated matrix is introduced +.>

Estimating to obtain the approximation of the neural network identifier>

The following are provided:

from the resulting approximation function

And obtaining estimated values of other variables.

Further, the optimal value function and the optimal control input are converted into the following expression form by means of dividing parameters:

wherein ,k_i Representing a constant term greater than zero; and is also provided with

The expression of (2) is

Further, the expression of the optimal value function and the optimal control input after the single commentator-based reinforcement learning structure is introduced is as follows:

wherein ,

representing an introduced estimated commentator network parameter matrix; s is S _i Representing a neural network radial basis function; critics network parameter matrix->

The update law of (c) is expressed as follows: />

wherein ,k_ci Represent the learning rate of the commentator network, phi _i The specific expression of (2) is as follows:

compared with the prior art, the invention has the following beneficial effects: aiming at the problems of redundant calculation errors and longer calculation time caused by the iteration of the actor-critic double network in the multi-agent formation control method based on the traditional actor-critic reinforcement learning structure, the invention provides the multi-agent formation control method based on the single-critic reinforcement learning structure. The method can effectively reduce the calculation time and the estimation error, and ensure that the formation behavior of the nonlinear multi-agent system is successfully completed.

Drawings

FIG. 1 is a block diagram of a conventional actor commentator reinforcement learning architecture in the prior art;

FIG. 2 is a block diagram of a single critter reinforcement learning structure in an embodiment of the present invention;

FIG. 3 is a communication topology of a nonlinear multi-agent system in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of a multi-agent formation track in an embodiment of the invention;

FIG. 5 is a schematic diagram of a multi-agent formation speed trajectory in an embodiment of the invention;

FIG. 6 is a graph of position error versus conventional actor commentator method position error in an embodiment of the present invention;

FIG. 7 is a graph of velocity error versus conventional actor commentator method velocity error in an embodiment of the present invention;

FIG. 8 is a graph of the calculated time contrast of the actor-critics and single critics reinforcement learning structure in an embodiment of the present invention;

fig. 9 is a flow chart of a method implementation of an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

The embodiment starts from the formation requirement of a nonlinear multi-agent system, and provides a multi-agent formation control method based on a single commentator reinforcement learning structure aiming at a nonlinear multi-agent system. As shown in fig. 9, the method comprises the steps of:

step one: based on graph theory in application mathematics, the communication structure of each intelligent agent of the multi-intelligent agent system is constructed, the system is considered to be a first-order multi-intelligent agent system, and each intelligent agent only obtains the position information of the neighbor intelligent agent. Meanwhile, one navigator intelligent body exists in the system, and other intelligent bodies serve as followers to move along the track of the navigator intelligent body in the running process.

Step two: for each agent in the system, the tracking error of the agent relative to the navigator agent is constructed according to the neighbor agent information obtained by the agent, and the errors of the agent and the navigator and the neighbor agent, namely the formation error, are constructed according to the tracking error.

Step three: a cost function and a value function associated with the formation error and the optimal control input are constructed based on the optimal control.

Step four: and (3) based on the Taylor formula and the value function obtained in the step three, performing expansion solution on the value function to obtain a corresponding Hamiltonian-Gu Kebi-Bellman (HJB) equation.

Step five: and aiming at the HJB equation obtained in the step four, solving the bias guide about the optimal control to obtain the expression form of the optimal control input about the optimal value function.

Step six: dividing the optimal value function to obtain the expression form of the optimal value function about formation errors and unknown functions, and obtaining the divided optimal control input form according to the optimal control input expression form of the step five.

The reinforcement learning structure of a single reviewer in this embodiment is shown in fig. 2. The single commentator reinforcement learning structure can remove the requirement on an actor network in the traditional actor-commentator reinforcement learning method, thereby effectively reducing the approximation error of a system and reducing the calculation time.

In the first step, the model expression of the multi-agent system is as follows:

wherein ,x_i (t) represents the amount of location of the ith agent in the system; u (u) _i (t) represents the control amount of the ith agent in the system; f (f) _i (. Cndot.) represents an unknown nonlinear function, which is assumed here to be Li Pu Hith continuous.

While the expected trajectory change of the leader agent is expressed by the following formula:

wherein ,p_l and v_l The trajectory and the speed of the pilot, i.e. the desired trajectory and speed in the movement of the formation, respectively.

In this embodiment, a communication topology diagram of the nonlinear multi-agent system is shown in fig. 3.

In the second step, according to the constructed multi-agent system model of the leader and the follower, the tracking error of each agent relative to the pilot is set as follows:

z _i ＝x _i -p _l -ζ _i

wherein ,ζ_i Representing the location between the pilot agent and the ith follower agent for describing the formation shape of the system.

wherein ,a_ij An ith row j column of the adjacency matrix in the graph theory; b _i The connection weight parameter between the ith follower agent and the navigator agent is set; Λi represents the neighbor set of the ith agent.

In the third step, the expression form of the cost function is obtained by combining the optimal control related knowledge and the defined formation error:

wherein c=diag { C ₁ ，c ₂ ，…，c _i ，…，c _n }；

w ₁ and w₂ Constants are set for two; i _m Is a proper dimensionA unit matrix of degrees; />

Is a tensor product sign.

The corresponding optimal value function is finally obtained and expressed as follows:

where τ represents the integration constant.

In the fourth step, a distributed solving method is established according to the established optimal value function, and the Hamiltonian-Gu Kebi-Belman equation is obtained as follows:

solving the equation above for its optimal control input

The expression to obtain the optimal control input is:

from this equation, it can be seen that the optimal control input required in this embodiment is a quantity related to the derivative of the optimal value function. However, due to the non-linearities of the multi-agent system and the unknown model, the derivative terms of the optimal value function are actually difficult to solve, which also results in a difficult solution of the optimal control input.

Step five: neural network algorithms have been shown to have a powerful approximation that can approximate nonlinear functions. For multiple agentsUnknown nonlinear term f existing in the system _i (x _i ) Approximate estimation is performed by introducing a neural network:

wherein ,

representing an ideal neural network weight matrix; s is S _fi (x _i ) Representing a basis function vector; e-shaped article _fi (x _i ) Representing the approximation error.

Due to

Estimating to obtain the approximation of the neural network identifier>

The following are provided:

representing a pair of actual nonlinear functions f generated by introducing a neural network method _i (x _i ) Is a function of the approximation of (a).

From the resulting approximation function

Can thus get->

Related oneSome variables->

and />

And the corresponding estimated values.

Estimating the matrix

The update is needed, and the corresponding update law can be expressed as the following form by design:

wherein ,T_i Representing a positive definite matrix; θ _i Representing a positive constant.

Step six: from the above-mentioned approximation variables obtained through the neural network, the optimum value function is divided to obtain a divided optimum value function expression form as follows:

The expression of (2) is

Expressed by this isolated value function, combined with an optimal control expression pattern

The isolated expression pattern for optimal control was obtained as follows:

step seven: by introducing a reinforcement learning structure based on a single commentator, performing approximate evaluation on the segmented optimal value function and the optimal control input, and obtaining the following expression:

wherein ,

representing an introduced estimated critic parameter matrix; s is S _i Representing the neural network radial basis functions. In the traditional actor commentator reinforcement learning structure, an actor network is required to execute control actions in a controller, and the commentator network only needs to evaluate the control actions in an optimal value function and feed back to the actor network for correction. In the invention, the actor network is removed by design, and the critics neural network is required to bear the responsibility of the actor network in the traditional method, namely, to execute the control action besides evaluating the control action.

Parameter matrix in commentator network

It is also necessary to update the update law expressed in the form of:

in order to prove that the optimal control input based on the single commentator reinforcement learning structure provided by the embodiment can realize the formation movement behavior of the nonlinear multi-agent system, corresponding simulation experiments are carried out, and the expression form of the nonlinear multi-agent system is as follows:

wherein ,h_i ＝-0.7，0.1，-0.5，0.1；

And the initial positions of the four follower intelligent agents are set as x _i (0)＝[4，4] ^T ，[-4，4] ^T ，[4，-4] ^T ，[-4，-4] ^T 。

The expected movement track set by the leader agent is:

wherein the initial position of the leader agent is [0,0] ^T 。

The information exchange between the agents needs to use the related knowledge of graph theory, and the matrix A is a communication weight matrix for describing the follower agents and the neighbor follower agents, and is expressed as follows:

the matrix B is a communication weight matrix for expressing the follower agent and the leader agent, and is expressed as follows:

B＝diag{1，0，0，0}

fig. 3 shows a communication topology diagram of a nonlinear multi-agent system in an embodiment of the present invention, and a multi-agent system in a conventional actor commentator method for comparison, which uses a secondary communication topology as a manner of inter-agent communication, thereby facilitating comparison. Fig. 4 shows a multi-agent formation track according to an embodiment of the present invention, where it can be seen that four follower agents can well follow the track of a pilot agent to move. FIG. 5 illustrates a multi-agent system formation speed trajectory in which four follower agents can keep up with the speed of a pilot agent in an embodiment of the invention. Fig. 6 is a diagram showing a comparison between a position error and a position error in a conventional actor commentator method in an embodiment of the present invention, and it can be seen that the position error of the proposed method is smaller than that obtained by the conventional method. Fig. 7 shows a comparison of speed errors in an embodiment of the present invention with speed errors in a conventional method, and it can be seen that the speed errors in the two methods are relatively close. Fig. 8 shows a comparison diagram of calculation time of an actor commentator method and a single commentator reinforcement learning structure in an embodiment of the present invention, it can be seen that the calculation time of the single commentator reinforcement learning structure method provided by the present invention is shorter than that of the actor commentator method, and the reduction time is more as the iteration number increases.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims

1. A multi-agent formation control method based on a single commentator reinforcement learning structure is characterized by comprising the following steps:

2. The multi-agent formation control method based on the single critique reinforcement learning structure according to claim 1, wherein the single critique reinforcement learning structure is used for removing the requirement for an actor network in the traditional actor-critique reinforcement learning method, so that the approximation error of a system is effectively reduced and the calculation time is reduced.

3. The multi-agent formation control method based on the single-commentator reinforcement learning structure according to claim 1, wherein in the first step, a model of the multi-agent system is expressed as:

the model of the navigator agent is as follows:

wherein ,p_l and v_l The track and speed of the navigator are respectively represented, namely, the expected track and speed in formation movement;

the tracking error of each intelligent agent relative to the pilot is set as follows:

z _i ＝x _i -p _l -ζ _i

wherein ,a_ij An ith row j column of the adjacency matrix in the graph theory; b _i The connection weight parameter between the ith follower agent and the navigator agent is set; Λ type _i Indicating the neighborhood of the ith agentAnd (5) residing and collecting.

4. The multi-agent formation control method based on the single critique reinforcement learning structure according to claim 3, wherein in the third step, the expression form of the cost function is obtained by combining the defined formation errors:

wherein c=diag { C ₁ ，c ₂ ，…，c _i ，…，c _n }；

Is a tensor product sign;

where τ represents the integration constant.

5. The multi-agent formation control method based on the single critique reinforcement learning structure according to claim 4, wherein in the fourth step, a hamilton-Gu Kebi-bellman equation is established as follows:

regarding the above equation

The expression to obtain the optimal control input is:

6. the multi-agent formation control method based on the single critique reinforcement learning structure according to claim 5, wherein the unknown nonlinear term f existing in the multi-agent system is aimed at _i (x _i ) Approximate estimation is performed by introducing a neural network:

wherein ,

due to

Estimating to obtain the approximation of the neural network identifier>

The following are provided:

from the resulting approximation function

And obtaining estimated values of other variables.

7. The multi-agent formation control method based on the single critique reinforcement learning structure according to claim 6, wherein the optimal value function and the optimal control input are converted into the following expression form by means of dividing parameters:

The expression of (2) is +.>

8. The multi-agent formation control method based on the single-reviewer reinforcement learning structure according to claim 7, wherein the expression of the optimal value function and the optimal control input after the single-reviewer reinforcement learning structure is introduced is as follows:

wherein ,

The update law of (c) is expressed as follows:

/>