CN116185020A - Multi-agent formation control method based on single commentator reinforcement learning structure - Google Patents

Multi-agent formation control method based on single commentator reinforcement learning structure Download PDF

Info

Publication number
CN116185020A
CN116185020A CN202310081638.1A CN202310081638A CN116185020A CN 116185020 A CN116185020 A CN 116185020A CN 202310081638 A CN202310081638 A CN 202310081638A CN 116185020 A CN116185020 A CN 116185020A
Authority
CN
China
Prior art keywords
agent
formation
reinforcement learning
optimal
value function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310081638.1A
Other languages
Chinese (zh)
Inventor
黄捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202310081638.1A priority Critical patent/CN116185020A/en
Publication of CN116185020A publication Critical patent/CN116185020A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0219Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory ensuring the processing of the whole working surface
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention relates to a multi-agent formation control method based on a single commentator reinforcement learning structure, which comprises the following steps: constructing a communication structure of each intelligent agent of the multi-intelligent agent system; constructing tracking errors of the intelligent agents relative to the pilot intelligent agents, and constructing errors describing the intelligent agents and the pilot and the intelligent agents and the neighbor intelligent agents, namely formation errors; constructing a cost function and a value function related to formation errors and optimal control inputs based on optimal control; performing expansion solution on the value function to construct a corresponding HJB equation; solving the bias guide about the optimal control for the HJB equation to obtain the expression form of the optimal control input about the optimal value function; dividing the optimal value function to obtain a divided optimal control input form; and introducing a single commentator reinforcement learning structure, and solving the obtained segmented optimal value function and optimal control input by combining a neural network. The method is beneficial to reducing estimation errors and calculation time.

Description

Multi-agent formation control method based on single commentator reinforcement learning structure
Technical Field
The invention belongs to the technical field of multi-agent formation control, and particularly relates to a multi-agent formation control method based on a single commentator reinforcement learning structure.
Background
The multi-agent system comprises an autonomous and interactive entity which share a common environment, and the agents can sense and execute operations according to the environment, wherein formation control is an application field of the multi-agent, and comprises formation application scenes such as satellites, underwater robots, unmanned aerial vehicle flight and the like. Many formation control methods, such as a leader follower, a virtual structure method, and the like, have been proposed through researches by scientists. The leader follower formation control method is used as a simple formation control algorithm with the characteristic of expandability, is widely applied to multi-agent formation at present, and has the strategy that one agent is set as a leader and a moving track is set, and then a controller is designed to control other follower agents to track the leader track.
Optimal control is an effective method of balancing control performance and controlling resource consumption, which can achieve control goals by minimizing cost functions. The dynamic programming is used as a method in optimal control, has wide application value, and the basic idea is to decompose an optimal solution for solving a large problem into a plurality of small problems. However, the difficulty of the inverse solution process and dimension disaster of this approach has hindered its further application and development. The appearance of self-adaptive dynamic programming combines an optimal control method with a reinforcement learning structure, overcomes the defects of the dynamic programming, and can estimate an unknown equation through function approximation. The method for combining the actor-commentator reinforcement learning structure with the optimal control can effectively solve the problem of hard solution of the Hamiltonian-Gu Kebi-Bellman equation in the optimal controller. However, this approach involves iterations of the dual network by the actors and critics, which can result in more computational errors and longer computation times.
Therefore, designing a multi-agent system formation control method based on reinforcement learning structure to reduce calculation time and calculation error still belongs to the open problem. Aiming at the problem, the invention removes the actor network, redesigns the reviewer network updating strategy, and enables the reviewer network updating strategy to evaluate the performance and correct the performance in time while executing the control behavior. The method can effectively reduce calculation time and estimation errors, and ensure that the formation behavior of the nonlinear multi-agent system is successfully completed.
Disclosure of Invention
The invention aims to provide a multi-agent formation control method based on a single commentator reinforcement learning structure, which is beneficial to reducing estimation errors and calculation time.
In order to achieve the above purpose, the invention adopts the following technical scheme: a multi-agent formation control method based on a single commentator reinforcement learning structure comprises the following steps:
step one: based on graph theory in application mathematics, constructing a communication structure of each intelligent agent of the multi-intelligent agent system, considering that the system is a first-order multi-intelligent agent system, and only obtaining the position information of the neighbor intelligent agent by each intelligent agent; meanwhile, one navigator intelligent body exists in the system, and other intelligent bodies serve as followers to move along the track of the navigator intelligent body in the running process;
step two: aiming at each agent in the system, constructing tracking errors of the agent relative to the navigator agent according to the obtained neighbor agent information, and constructing errors for describing the agent and the navigator and the neighbor agent according to the tracking errors, namely formation errors;
step three: constructing a cost function and a value function related to formation errors and optimal control inputs based on optimal control;
step four: based on the Taylor formula and the value function obtained in the step three, performing expansion solution on the value function to obtain a corresponding Hamiltonian-Gu Kebi-Bellman equation;
step five: aiming at the Hamiltonian-Gu Kebi-Belman equation obtained in the step four, solving the partial derivative about optimal control to obtain the expression form of the optimal control input about an optimal value function;
step six: dividing the optimal value function to obtain the expression form of the optimal value function about formation errors and unknown functions, and obtaining a divided optimal control input form according to the optimal control input expression form of the step five;
step seven: and D, introducing a single commentator reinforcement learning structure, and solving the segmented optimal value function and optimal control input obtained in the step six by combining a neural network, wherein the neural network approximates unknown nonlinear terms in the multi-agent system, the commentator network carries out formation control of the agent system, and the effect of the formation control is evaluated and improved.
Further, the single reviewer reinforcement learning structure is used for removing the requirement on an actor network in the traditional actor-reviewer reinforcement learning method, so that the approximation error of the system is effectively reduced and the calculation time is reduced.
Further, in the first step, the model of the multi-agent system is expressed as:
Figure BDA0004067605080000021
wherein ,xi (t) represents the amount of location of the ith agent in the system; u (u) i (t) represents the control amount of the ith agent in the system; f (f) i (. Cndot.) represents an unknown nonlinear function and is assumed to be Li Pu Hith continuous;
the model of the navigator agent is as follows:
Figure BDA0004067605080000022
wherein ,pl and vl The track and speed of the navigator are respectively represented, namely, the expected track and speed in formation movement; the tracking error of each intelligent agent relative to the pilot is set as follows:
z i =x i -p li
wherein ,ζi Representing a location between a pilot agent and an ith follower agent for describing a formation shape of the system;
from the structure of the tracking error, a formation error form is defined as follows:
Figure BDA0004067605080000031
wherein ,aij An ith row j column of the adjacency matrix in the graph theory; b i The connection weight parameter between the ith follower agent and the navigator agent is set; Λ type i Representing the neighbor set of the ith agent.
Further, in the third step, the expression form of the cost function is obtained by combining the defined formation errors:
Figure BDA00040676050800000312
wherein c=dia. g {c 1 ,c 2 ,…,c i ,…,c n };
Figure BDA0004067605080000032
w 1 and w2 Constants are set for two; i m Is a unit matrix with proper dimension; />
Figure BDA00040676050800000313
Is a tensor product sign;
according to the obtained cost function, a corresponding value function is established, and the optimal control input is introduced
Figure BDA0004067605080000033
The corresponding optimal value function is finally obtained and expressed as follows: />
Figure BDA0004067605080000034
Where τ represents the integration constant.
Further, in step four, a Hamiltonian-Gu Kebi-Belman equation is established as follows:
Figure BDA0004067605080000035
regarding the above equation
Figure BDA0004067605080000036
The expression to obtain the optimal control input is:
Figure BDA0004067605080000037
further, for unknown nonlinear term f existing within the multi-intelligent system i (x i ) Approximate estimation is performed by introducing a neural network:
Figure BDA0004067605080000038
wherein ,
Figure BDA0004067605080000039
representing an ideal neural network weight matrix; s is S fi (x i ) Representing a basis function vector; e-shaped article fi (x i ) Representing an approximation error;
due to
Figure BDA00040676050800000310
For theoretical analysis only but in practice an unknown matrix, so an estimated matrix is introduced +.>
Figure BDA00040676050800000311
Estimating to obtain the approximation of the neural network identifier>
Figure BDA0004067605080000041
The following are provided:
Figure BDA0004067605080000042
from the resulting approximation function
Figure BDA0004067605080000043
And obtaining estimated values of other variables.
Further, the optimal value function and the optimal control input are converted into the following expression form by means of dividing parameters:
Figure BDA0004067605080000044
Figure BDA0004067605080000045
wherein ,ki Representing a constant term greater than zero; and is also provided with
Figure BDA0004067605080000046
The expression of (2) is
Figure BDA0004067605080000047
Further, the expression of the optimal value function and the optimal control input after the single commentator-based reinforcement learning structure is introduced is as follows:
Figure BDA0004067605080000048
Figure BDA0004067605080000049
wherein ,
Figure BDA00040676050800000410
representing an introduced estimated commentator network parameter matrix; s is S i Representing a neural network radial basis function; critics network parameter matrix->
Figure BDA00040676050800000411
The update law of (c) is expressed as follows: />
Figure BDA00040676050800000412
wherein ,kci Represent the learning rate of the commentator network, phi i The specific expression of (2) is as follows:
Figure BDA00040676050800000413
compared with the prior art, the invention has the following beneficial effects: aiming at the problems of redundant calculation errors and longer calculation time caused by the iteration of the actor-critic double network in the multi-agent formation control method based on the traditional actor-critic reinforcement learning structure, the invention provides the multi-agent formation control method based on the single-critic reinforcement learning structure. The method can effectively reduce the calculation time and the estimation error, and ensure that the formation behavior of the nonlinear multi-agent system is successfully completed.
Drawings
FIG. 1 is a block diagram of a conventional actor commentator reinforcement learning architecture in the prior art;
FIG. 2 is a block diagram of a single critter reinforcement learning structure in an embodiment of the present invention;
FIG. 3 is a communication topology of a nonlinear multi-agent system in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of a multi-agent formation track in an embodiment of the invention;
FIG. 5 is a schematic diagram of a multi-agent formation speed trajectory in an embodiment of the invention;
FIG. 6 is a graph of position error versus conventional actor commentator method position error in an embodiment of the present invention;
FIG. 7 is a graph of velocity error versus conventional actor commentator method velocity error in an embodiment of the present invention;
FIG. 8 is a graph of the calculated time contrast of the actor-critics and single critics reinforcement learning structure in an embodiment of the present invention;
fig. 9 is a flow chart of a method implementation of an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
The embodiment starts from the formation requirement of a nonlinear multi-agent system, and provides a multi-agent formation control method based on a single commentator reinforcement learning structure aiming at a nonlinear multi-agent system. As shown in fig. 9, the method comprises the steps of:
step one: based on graph theory in application mathematics, the communication structure of each intelligent agent of the multi-intelligent agent system is constructed, the system is considered to be a first-order multi-intelligent agent system, and each intelligent agent only obtains the position information of the neighbor intelligent agent. Meanwhile, one navigator intelligent body exists in the system, and other intelligent bodies serve as followers to move along the track of the navigator intelligent body in the running process.
Step two: for each agent in the system, the tracking error of the agent relative to the navigator agent is constructed according to the neighbor agent information obtained by the agent, and the errors of the agent and the navigator and the neighbor agent, namely the formation error, are constructed according to the tracking error.
Step three: a cost function and a value function associated with the formation error and the optimal control input are constructed based on the optimal control.
Step four: and (3) based on the Taylor formula and the value function obtained in the step three, performing expansion solution on the value function to obtain a corresponding Hamiltonian-Gu Kebi-Bellman (HJB) equation.
Step five: and aiming at the HJB equation obtained in the step four, solving the bias guide about the optimal control to obtain the expression form of the optimal control input about the optimal value function.
Step six: dividing the optimal value function to obtain the expression form of the optimal value function about formation errors and unknown functions, and obtaining the divided optimal control input form according to the optimal control input expression form of the step five.
Step seven: and D, introducing a single commentator reinforcement learning structure, and solving the segmented optimal value function and optimal control input obtained in the step six by combining a neural network, wherein the neural network approximates unknown nonlinear terms in the multi-agent system, the commentator network carries out formation control of the agent system, and the effect of the formation control is evaluated and improved.
The reinforcement learning structure of a single reviewer in this embodiment is shown in fig. 2. The single commentator reinforcement learning structure can remove the requirement on an actor network in the traditional actor-commentator reinforcement learning method, thereby effectively reducing the approximation error of a system and reducing the calculation time.
In the first step, the model expression of the multi-agent system is as follows:
Figure BDA0004067605080000061
wherein ,xi (t) represents the amount of location of the ith agent in the system; u (u) i (t) represents the control amount of the ith agent in the system; f (f) i (. Cndot.) represents an unknown nonlinear function, which is assumed here to be Li Pu Hith continuous.
While the expected trajectory change of the leader agent is expressed by the following formula:
Figure BDA0004067605080000062
wherein ,pl and vl The trajectory and the speed of the pilot, i.e. the desired trajectory and speed in the movement of the formation, respectively.
In this embodiment, a communication topology diagram of the nonlinear multi-agent system is shown in fig. 3.
In the second step, according to the constructed multi-agent system model of the leader and the follower, the tracking error of each agent relative to the pilot is set as follows:
z i =x i -p li
wherein ,ζi Representing the location between the pilot agent and the ith follower agent for describing the formation shape of the system.
From the structure of the tracking error, a formation error form is defined as follows:
Figure BDA0004067605080000071
wherein ,aij An ith row j column of the adjacency matrix in the graph theory; b i The connection weight parameter between the ith follower agent and the navigator agent is set; Λi represents the neighbor set of the ith agent.
In the third step, the expression form of the cost function is obtained by combining the optimal control related knowledge and the defined formation error:
Figure BDA00040676050800000710
wherein c=diag { C 1 ,c 2 ,…,c i ,…,c n };
Figure BDA0004067605080000072
w 1 and w2 Constants are set for two; i m Is a proper dimensionA unit matrix of degrees; />
Figure BDA00040676050800000711
Is a tensor product sign.
According to the obtained cost function, a corresponding value function is established, and the optimal control input is introduced
Figure BDA0004067605080000073
The corresponding optimal value function is finally obtained and expressed as follows:
Figure BDA0004067605080000074
where τ represents the integration constant.
In the fourth step, a distributed solving method is established according to the established optimal value function, and the Hamiltonian-Gu Kebi-Belman equation is obtained as follows:
Figure BDA0004067605080000075
solving the equation above for its optimal control input
Figure BDA0004067605080000076
The expression to obtain the optimal control input is:
Figure BDA0004067605080000077
from this equation, it can be seen that the optimal control input required in this embodiment is a quantity related to the derivative of the optimal value function. However, due to the non-linearities of the multi-agent system and the unknown model, the derivative terms of the optimal value function are actually difficult to solve, which also results in a difficult solution of the optimal control input.
Step five: neural network algorithms have been shown to have a powerful approximation that can approximate nonlinear functions. For multiple agentsUnknown nonlinear term f existing in the system i (x i ) Approximate estimation is performed by introducing a neural network:
Figure BDA0004067605080000078
wherein ,
Figure BDA0004067605080000079
representing an ideal neural network weight matrix; s is S fi (x i ) Representing a basis function vector; e-shaped article fi (x i ) Representing the approximation error.
Due to
Figure BDA0004067605080000081
For theoretical analysis only but in practice an unknown matrix, so an estimated matrix is introduced +.>
Figure BDA0004067605080000082
Estimating to obtain the approximation of the neural network identifier>
Figure BDA0004067605080000083
The following are provided:
Figure BDA0004067605080000084
Figure BDA0004067605080000085
representing a pair of actual nonlinear functions f generated by introducing a neural network method i (x i ) Is a function of the approximation of (a).
From the resulting approximation function
Figure BDA0004067605080000086
Can thus get->
Figure BDA0004067605080000087
Related oneSome variables->
Figure BDA0004067605080000088
and />
Figure BDA0004067605080000089
And the corresponding estimated values.
Estimating the matrix
Figure BDA00040676050800000810
The update is needed, and the corresponding update law can be expressed as the following form by design:
Figure BDA00040676050800000811
wherein ,Ti Representing a positive definite matrix; θ i Representing a positive constant.
Step six: from the above-mentioned approximation variables obtained through the neural network, the optimum value function is divided to obtain a divided optimum value function expression form as follows:
Figure BDA00040676050800000812
wherein ,ki Representing a constant term greater than zero; and is also provided with
Figure BDA00040676050800000813
The expression of (2) is
Figure BDA00040676050800000814
Expressed by this isolated value function, combined with an optimal control expression pattern
Figure BDA00040676050800000815
The isolated expression pattern for optimal control was obtained as follows:
Figure BDA00040676050800000816
step seven: by introducing a reinforcement learning structure based on a single commentator, performing approximate evaluation on the segmented optimal value function and the optimal control input, and obtaining the following expression:
Figure BDA00040676050800000817
Figure BDA00040676050800000818
wherein ,
Figure BDA00040676050800000819
representing an introduced estimated critic parameter matrix; s is S i Representing the neural network radial basis functions. In the traditional actor commentator reinforcement learning structure, an actor network is required to execute control actions in a controller, and the commentator network only needs to evaluate the control actions in an optimal value function and feed back to the actor network for correction. In the invention, the actor network is removed by design, and the critics neural network is required to bear the responsibility of the actor network in the traditional method, namely, to execute the control action besides evaluating the control action.
Parameter matrix in commentator network
Figure BDA0004067605080000091
It is also necessary to update the update law expressed in the form of:
Figure BDA0004067605080000092
wherein ,kci Represent the learning rate of the commentator network, phi i The specific expression of (2) is as follows:
Figure BDA0004067605080000093
in order to prove that the optimal control input based on the single commentator reinforcement learning structure provided by the embodiment can realize the formation movement behavior of the nonlinear multi-agent system, corresponding simulation experiments are carried out, and the expression form of the nonlinear multi-agent system is as follows:
Figure BDA0004067605080000094
wherein ,hi =-0.7,0.1,-0.5,0.1;
Figure BDA0004067605080000095
And the initial positions of the four follower intelligent agents are set as x i (0)=[4,4] T ,[-4,4] T ,[4,-4] T ,[-4,-4] T
The expected movement track set by the leader agent is:
Figure BDA0004067605080000096
wherein the initial position of the leader agent is [0,0] T
The information exchange between the agents needs to use the related knowledge of graph theory, and the matrix A is a communication weight matrix for describing the follower agents and the neighbor follower agents, and is expressed as follows:
Figure BDA0004067605080000097
the matrix B is a communication weight matrix for expressing the follower agent and the leader agent, and is expressed as follows:
B=diag{1,0,0,0}
fig. 3 shows a communication topology diagram of a nonlinear multi-agent system in an embodiment of the present invention, and a multi-agent system in a conventional actor commentator method for comparison, which uses a secondary communication topology as a manner of inter-agent communication, thereby facilitating comparison. Fig. 4 shows a multi-agent formation track according to an embodiment of the present invention, where it can be seen that four follower agents can well follow the track of a pilot agent to move. FIG. 5 illustrates a multi-agent system formation speed trajectory in which four follower agents can keep up with the speed of a pilot agent in an embodiment of the invention. Fig. 6 is a diagram showing a comparison between a position error and a position error in a conventional actor commentator method in an embodiment of the present invention, and it can be seen that the position error of the proposed method is smaller than that obtained by the conventional method. Fig. 7 shows a comparison of speed errors in an embodiment of the present invention with speed errors in a conventional method, and it can be seen that the speed errors in the two methods are relatively close. Fig. 8 shows a comparison diagram of calculation time of an actor commentator method and a single commentator reinforcement learning structure in an embodiment of the present invention, it can be seen that the calculation time of the single commentator reinforcement learning structure method provided by the present invention is shorter than that of the actor commentator method, and the reduction time is more as the iteration number increases.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims (8)

1. A multi-agent formation control method based on a single commentator reinforcement learning structure is characterized by comprising the following steps:
step one: based on graph theory in application mathematics, constructing a communication structure of each intelligent agent of the multi-intelligent agent system, considering that the system is a first-order multi-intelligent agent system, and only obtaining the position information of the neighbor intelligent agent by each intelligent agent; meanwhile, one navigator intelligent body exists in the system, and other intelligent bodies serve as followers to move along the track of the navigator intelligent body in the running process;
step two: aiming at each agent in the system, constructing tracking errors of the agent relative to the navigator agent according to the obtained neighbor agent information, and constructing errors for describing the agent and the navigator and the neighbor agent according to the tracking errors, namely formation errors;
step three: constructing a cost function and a value function related to formation errors and optimal control inputs based on optimal control;
step four: based on the Taylor formula and the value function obtained in the step three, performing expansion solution on the value function to obtain a corresponding Hamiltonian-Gu Kebi-Bellman equation;
step five: aiming at the Hamiltonian-Gu Kebi-Belman equation obtained in the step four, solving the partial derivative about optimal control to obtain the expression form of the optimal control input about an optimal value function;
step six: dividing the optimal value function to obtain the expression form of the optimal value function about formation errors and unknown functions, and obtaining a divided optimal control input form according to the optimal control input expression form of the step five;
step seven: and D, introducing a single commentator reinforcement learning structure, and solving the segmented optimal value function and optimal control input obtained in the step six by combining a neural network, wherein the neural network approximates unknown nonlinear terms in the multi-agent system, the commentator network carries out formation control of the agent system, and the effect of the formation control is evaluated and improved.
2. The multi-agent formation control method based on the single critique reinforcement learning structure according to claim 1, wherein the single critique reinforcement learning structure is used for removing the requirement for an actor network in the traditional actor-critique reinforcement learning method, so that the approximation error of a system is effectively reduced and the calculation time is reduced.
3. The multi-agent formation control method based on the single-commentator reinforcement learning structure according to claim 1, wherein in the first step, a model of the multi-agent system is expressed as:
Figure FDA0004067605030000011
wherein ,xi (t) represents the amount of location of the ith agent in the system; u (u) i (t) represents the control amount of the ith agent in the system; f (f) i (. Cndot.) represents an unknown nonlinear function and is assumed to be Li Pu Hith continuous;
the model of the navigator agent is as follows:
Figure FDA0004067605030000021
wherein ,pl and vl The track and speed of the navigator are respectively represented, namely, the expected track and speed in formation movement;
the tracking error of each intelligent agent relative to the pilot is set as follows:
z i =x i -p li
wherein ,ζi Representing a location between a pilot agent and an ith follower agent for describing a formation shape of the system;
from the structure of the tracking error, a formation error form is defined as follows:
Figure FDA0004067605030000022
wherein ,aij An ith row j column of the adjacency matrix in the graph theory; b i The connection weight parameter between the ith follower agent and the navigator agent is set; Λ type i Indicating the neighborhood of the ith agentAnd (5) residing and collecting.
4. The multi-agent formation control method based on the single critique reinforcement learning structure according to claim 3, wherein in the third step, the expression form of the cost function is obtained by combining the defined formation errors:
Figure FDA0004067605030000023
wherein c=diag { C 1 ,c 2 ,…,c i ,…,c n };
Figure FDA0004067605030000024
w 1 and w2 Constants are set for two; i m Is a unit matrix with proper dimension; />
Figure FDA0004067605030000025
Is a tensor product sign;
according to the obtained cost function, a corresponding value function is established, and the optimal control input is introduced
Figure FDA0004067605030000026
The corresponding optimal value function is finally obtained and expressed as follows:
Figure FDA0004067605030000027
where τ represents the integration constant.
5. The multi-agent formation control method based on the single critique reinforcement learning structure according to claim 4, wherein in the fourth step, a hamilton-Gu Kebi-bellman equation is established as follows:
Figure FDA0004067605030000028
regarding the above equation
Figure FDA0004067605030000029
The expression to obtain the optimal control input is:
Figure FDA00040676050300000210
6. the multi-agent formation control method based on the single critique reinforcement learning structure according to claim 5, wherein the unknown nonlinear term f existing in the multi-agent system is aimed at i (x i ) Approximate estimation is performed by introducing a neural network:
Figure FDA0004067605030000031
wherein ,
Figure FDA0004067605030000032
representing an ideal neural network weight matrix; s is S fi (x i ) Representing a basis function vector; e-shaped article fi (x i ) Representing an approximation error;
due to
Figure FDA0004067605030000033
For theoretical analysis only but in practice an unknown matrix, so an estimated matrix is introduced +.>
Figure FDA0004067605030000034
Estimating to obtain the approximation of the neural network identifier>
Figure FDA0004067605030000035
The following are provided:
Figure FDA0004067605030000036
from the resulting approximation function
Figure FDA0004067605030000037
And obtaining estimated values of other variables.
7. The multi-agent formation control method based on the single critique reinforcement learning structure according to claim 6, wherein the optimal value function and the optimal control input are converted into the following expression form by means of dividing parameters:
Figure FDA0004067605030000038
Figure FDA0004067605030000039
wherein ,ki Representing a constant term greater than zero; and is also provided with
Figure FDA00040676050300000310
The expression of (2) is +.>
Figure FDA00040676050300000311
8. The multi-agent formation control method based on the single-reviewer reinforcement learning structure according to claim 7, wherein the expression of the optimal value function and the optimal control input after the single-reviewer reinforcement learning structure is introduced is as follows:
Figure FDA00040676050300000312
Figure FDA00040676050300000313
wherein ,
Figure FDA00040676050300000314
representing an introduced estimated commentator network parameter matrix; s is S i Representing a neural network radial basis function; critics network parameter matrix->
Figure FDA00040676050300000315
The update law of (c) is expressed as follows:
Figure FDA0004067605030000041
wherein ,kci Represent the learning rate of the commentator network, phi i The specific expression of (2) is as follows:
Figure FDA0004067605030000042
/>
CN202310081638.1A 2023-01-19 2023-01-19 Multi-agent formation control method based on single commentator reinforcement learning structure Pending CN116185020A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310081638.1A CN116185020A (en) 2023-01-19 2023-01-19 Multi-agent formation control method based on single commentator reinforcement learning structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310081638.1A CN116185020A (en) 2023-01-19 2023-01-19 Multi-agent formation control method based on single commentator reinforcement learning structure

Publications (1)

Publication Number Publication Date
CN116185020A true CN116185020A (en) 2023-05-30

Family

ID=86435960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310081638.1A Pending CN116185020A (en) 2023-01-19 2023-01-19 Multi-agent formation control method based on single commentator reinforcement learning structure

Country Status (1)

Country Link
CN (1) CN116185020A (en)

Similar Documents

Publication Publication Date Title
CN110597061B (en) Multi-agent fully-distributed active-disturbance-rejection time-varying formation control method
Badgwell et al. Reinforcement learning–overview of recent progress and implications for process control
CN110181508B (en) Three-dimensional route planning method and system for underwater robot
CN111897224B (en) Multi-agent formation control method based on actor-critic reinforcement learning and fuzzy logic
Xu et al. Two-layer distributed hybrid affine formation control of networked Euler–Lagrange systems
CN110658821A (en) Multi-robot anti-interference grouping time-varying formation control method and system
CN112947575B (en) Unmanned aerial vehicle cluster multi-target searching method and system based on deep reinforcement learning
Shi et al. A learning approach to image-based visual servoing with a bagging method of velocity calculations
CN112947086B (en) Self-adaptive compensation method for actuator faults in formation control of heterogeneous multi-agent system consisting of unmanned aerial vehicle and unmanned vehicle
Wang et al. Command filter based globally stable adaptive neural control for cooperative path following of multiple underactuated autonomous underwater vehicles with partial knowledge of the reference speed
CN113900380A (en) Robust output formation tracking control method and system for heterogeneous cluster system
CN114237041B (en) Space-ground cooperative fixed time fault tolerance control method based on preset performance
CN116700327A (en) Unmanned aerial vehicle track planning method based on continuous action dominant function learning
CN113472242A (en) Anti-interference self-adaptive fuzzy sliding film cooperative control method based on multiple intelligent agents
Belmonte-Baeza et al. Meta reinforcement learning for optimal design of legged robots
CN111798494A (en) Maneuvering target robust tracking method under generalized correlation entropy criterion
Kim et al. TOAST: Trajectory Optimization and Simultaneous Tracking Using Shared Neural Network Dynamics
CN111176324B (en) Method for avoiding dynamic obstacle by multi-unmanned aerial vehicle distributed collaborative formation
Fan et al. Spatiotemporal path tracking via deep reinforcement learning of robot for manufacturing internal logistics
CN113485323B (en) Flexible formation method for cascading multiple mobile robots
CN116449703A (en) AUH formation cooperative control method under finite time frame
CN116185020A (en) Multi-agent formation control method based on single commentator reinforcement learning structure
CN114063438B (en) Data-driven multi-agent system PID control protocol self-learning method
CN114200830B (en) Multi-agent consistency reinforcement learning control method
Hwang et al. Fuzzy adaptive finite-time cooperative control with input saturation for nonlinear multiagent systems and its application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination