CN116185020A - Multi-agent formation control method based on single commentator reinforcement learning structure - Google Patents
Multi-agent formation control method based on single commentator reinforcement learning structure Download PDFInfo
- Publication number
- CN116185020A CN116185020A CN202310081638.1A CN202310081638A CN116185020A CN 116185020 A CN116185020 A CN 116185020A CN 202310081638 A CN202310081638 A CN 202310081638A CN 116185020 A CN116185020 A CN 116185020A
- Authority
- CN
- China
- Prior art keywords
- agent
- formation
- reinforcement learning
- optimal
- value function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 62
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000002787 reinforcement Effects 0.000 title claims abstract description 41
- 230000006870 function Effects 0.000 claims abstract description 71
- 238000013528 artificial neural network Methods 0.000 claims abstract description 23
- 238000004364 calculation method Methods 0.000 claims abstract description 13
- 238000004891 communication Methods 0.000 claims abstract description 11
- 239000003795 chemical substances by application Substances 0.000 claims description 122
- 239000011159 matrix material Substances 0.000 claims description 25
- 230000008569 process Effects 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 3
- 230000010354 integration Effects 0.000 claims description 3
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 8
- 230000009471 action Effects 0.000 description 4
- 230000006399 behavior Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0219—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory ensuring the processing of the whole working surface
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/02—Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The invention relates to a multi-agent formation control method based on a single commentator reinforcement learning structure, which comprises the following steps: constructing a communication structure of each intelligent agent of the multi-intelligent agent system; constructing tracking errors of the intelligent agents relative to the pilot intelligent agents, and constructing errors describing the intelligent agents and the pilot and the intelligent agents and the neighbor intelligent agents, namely formation errors; constructing a cost function and a value function related to formation errors and optimal control inputs based on optimal control; performing expansion solution on the value function to construct a corresponding HJB equation; solving the bias guide about the optimal control for the HJB equation to obtain the expression form of the optimal control input about the optimal value function; dividing the optimal value function to obtain a divided optimal control input form; and introducing a single commentator reinforcement learning structure, and solving the obtained segmented optimal value function and optimal control input by combining a neural network. The method is beneficial to reducing estimation errors and calculation time.
Description
Technical Field
The invention belongs to the technical field of multi-agent formation control, and particularly relates to a multi-agent formation control method based on a single commentator reinforcement learning structure.
Background
The multi-agent system comprises an autonomous and interactive entity which share a common environment, and the agents can sense and execute operations according to the environment, wherein formation control is an application field of the multi-agent, and comprises formation application scenes such as satellites, underwater robots, unmanned aerial vehicle flight and the like. Many formation control methods, such as a leader follower, a virtual structure method, and the like, have been proposed through researches by scientists. The leader follower formation control method is used as a simple formation control algorithm with the characteristic of expandability, is widely applied to multi-agent formation at present, and has the strategy that one agent is set as a leader and a moving track is set, and then a controller is designed to control other follower agents to track the leader track.
Optimal control is an effective method of balancing control performance and controlling resource consumption, which can achieve control goals by minimizing cost functions. The dynamic programming is used as a method in optimal control, has wide application value, and the basic idea is to decompose an optimal solution for solving a large problem into a plurality of small problems. However, the difficulty of the inverse solution process and dimension disaster of this approach has hindered its further application and development. The appearance of self-adaptive dynamic programming combines an optimal control method with a reinforcement learning structure, overcomes the defects of the dynamic programming, and can estimate an unknown equation through function approximation. The method for combining the actor-commentator reinforcement learning structure with the optimal control can effectively solve the problem of hard solution of the Hamiltonian-Gu Kebi-Bellman equation in the optimal controller. However, this approach involves iterations of the dual network by the actors and critics, which can result in more computational errors and longer computation times.
Therefore, designing a multi-agent system formation control method based on reinforcement learning structure to reduce calculation time and calculation error still belongs to the open problem. Aiming at the problem, the invention removes the actor network, redesigns the reviewer network updating strategy, and enables the reviewer network updating strategy to evaluate the performance and correct the performance in time while executing the control behavior. The method can effectively reduce calculation time and estimation errors, and ensure that the formation behavior of the nonlinear multi-agent system is successfully completed.
Disclosure of Invention
The invention aims to provide a multi-agent formation control method based on a single commentator reinforcement learning structure, which is beneficial to reducing estimation errors and calculation time.
In order to achieve the above purpose, the invention adopts the following technical scheme: a multi-agent formation control method based on a single commentator reinforcement learning structure comprises the following steps:
step one: based on graph theory in application mathematics, constructing a communication structure of each intelligent agent of the multi-intelligent agent system, considering that the system is a first-order multi-intelligent agent system, and only obtaining the position information of the neighbor intelligent agent by each intelligent agent; meanwhile, one navigator intelligent body exists in the system, and other intelligent bodies serve as followers to move along the track of the navigator intelligent body in the running process;
step two: aiming at each agent in the system, constructing tracking errors of the agent relative to the navigator agent according to the obtained neighbor agent information, and constructing errors for describing the agent and the navigator and the neighbor agent according to the tracking errors, namely formation errors;
step three: constructing a cost function and a value function related to formation errors and optimal control inputs based on optimal control;
step four: based on the Taylor formula and the value function obtained in the step three, performing expansion solution on the value function to obtain a corresponding Hamiltonian-Gu Kebi-Bellman equation;
step five: aiming at the Hamiltonian-Gu Kebi-Belman equation obtained in the step four, solving the partial derivative about optimal control to obtain the expression form of the optimal control input about an optimal value function;
step six: dividing the optimal value function to obtain the expression form of the optimal value function about formation errors and unknown functions, and obtaining a divided optimal control input form according to the optimal control input expression form of the step five;
step seven: and D, introducing a single commentator reinforcement learning structure, and solving the segmented optimal value function and optimal control input obtained in the step six by combining a neural network, wherein the neural network approximates unknown nonlinear terms in the multi-agent system, the commentator network carries out formation control of the agent system, and the effect of the formation control is evaluated and improved.
Further, the single reviewer reinforcement learning structure is used for removing the requirement on an actor network in the traditional actor-reviewer reinforcement learning method, so that the approximation error of the system is effectively reduced and the calculation time is reduced.
Further, in the first step, the model of the multi-agent system is expressed as:
wherein ,xi (t) represents the amount of location of the ith agent in the system; u (u) i (t) represents the control amount of the ith agent in the system; f (f) i (. Cndot.) represents an unknown nonlinear function and is assumed to be Li Pu Hith continuous;
the model of the navigator agent is as follows:
wherein ,pl and vl The track and speed of the navigator are respectively represented, namely, the expected track and speed in formation movement; the tracking error of each intelligent agent relative to the pilot is set as follows:
z i =x i -p l -ζ i
wherein ,ζi Representing a location between a pilot agent and an ith follower agent for describing a formation shape of the system;
from the structure of the tracking error, a formation error form is defined as follows:
wherein ,aij An ith row j column of the adjacency matrix in the graph theory; b i The connection weight parameter between the ith follower agent and the navigator agent is set; Λ type i Representing the neighbor set of the ith agent.
Further, in the third step, the expression form of the cost function is obtained by combining the defined formation errors:
wherein c=dia. g {c 1 ,c 2 ,…,c i ,…,c n };w 1 and w2 Constants are set for two; i m Is a unit matrix with proper dimension; />Is a tensor product sign;
according to the obtained cost function, a corresponding value function is established, and the optimal control input is introducedThe corresponding optimal value function is finally obtained and expressed as follows: />
Where τ represents the integration constant.
Further, in step four, a Hamiltonian-Gu Kebi-Belman equation is established as follows:
further, for unknown nonlinear term f existing within the multi-intelligent system i (x i ) Approximate estimation is performed by introducing a neural network:
wherein ,representing an ideal neural network weight matrix; s is S fi (x i ) Representing a basis function vector; e-shaped article fi (x i ) Representing an approximation error;
due toFor theoretical analysis only but in practice an unknown matrix, so an estimated matrix is introduced +.>Estimating to obtain the approximation of the neural network identifier>The following are provided:
Further, the optimal value function and the optimal control input are converted into the following expression form by means of dividing parameters:
wherein ,ki Representing a constant term greater than zero; and is also provided withThe expression of (2) is
Further, the expression of the optimal value function and the optimal control input after the single commentator-based reinforcement learning structure is introduced is as follows:
wherein ,representing an introduced estimated commentator network parameter matrix; s is S i Representing a neural network radial basis function; critics network parameter matrix->The update law of (c) is expressed as follows: />
wherein ,kci Represent the learning rate of the commentator network, phi i The specific expression of (2) is as follows:
compared with the prior art, the invention has the following beneficial effects: aiming at the problems of redundant calculation errors and longer calculation time caused by the iteration of the actor-critic double network in the multi-agent formation control method based on the traditional actor-critic reinforcement learning structure, the invention provides the multi-agent formation control method based on the single-critic reinforcement learning structure. The method can effectively reduce the calculation time and the estimation error, and ensure that the formation behavior of the nonlinear multi-agent system is successfully completed.
Drawings
FIG. 1 is a block diagram of a conventional actor commentator reinforcement learning architecture in the prior art;
FIG. 2 is a block diagram of a single critter reinforcement learning structure in an embodiment of the present invention;
FIG. 3 is a communication topology of a nonlinear multi-agent system in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of a multi-agent formation track in an embodiment of the invention;
FIG. 5 is a schematic diagram of a multi-agent formation speed trajectory in an embodiment of the invention;
FIG. 6 is a graph of position error versus conventional actor commentator method position error in an embodiment of the present invention;
FIG. 7 is a graph of velocity error versus conventional actor commentator method velocity error in an embodiment of the present invention;
FIG. 8 is a graph of the calculated time contrast of the actor-critics and single critics reinforcement learning structure in an embodiment of the present invention;
fig. 9 is a flow chart of a method implementation of an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
The embodiment starts from the formation requirement of a nonlinear multi-agent system, and provides a multi-agent formation control method based on a single commentator reinforcement learning structure aiming at a nonlinear multi-agent system. As shown in fig. 9, the method comprises the steps of:
step one: based on graph theory in application mathematics, the communication structure of each intelligent agent of the multi-intelligent agent system is constructed, the system is considered to be a first-order multi-intelligent agent system, and each intelligent agent only obtains the position information of the neighbor intelligent agent. Meanwhile, one navigator intelligent body exists in the system, and other intelligent bodies serve as followers to move along the track of the navigator intelligent body in the running process.
Step two: for each agent in the system, the tracking error of the agent relative to the navigator agent is constructed according to the neighbor agent information obtained by the agent, and the errors of the agent and the navigator and the neighbor agent, namely the formation error, are constructed according to the tracking error.
Step three: a cost function and a value function associated with the formation error and the optimal control input are constructed based on the optimal control.
Step four: and (3) based on the Taylor formula and the value function obtained in the step three, performing expansion solution on the value function to obtain a corresponding Hamiltonian-Gu Kebi-Bellman (HJB) equation.
Step five: and aiming at the HJB equation obtained in the step four, solving the bias guide about the optimal control to obtain the expression form of the optimal control input about the optimal value function.
Step six: dividing the optimal value function to obtain the expression form of the optimal value function about formation errors and unknown functions, and obtaining the divided optimal control input form according to the optimal control input expression form of the step five.
Step seven: and D, introducing a single commentator reinforcement learning structure, and solving the segmented optimal value function and optimal control input obtained in the step six by combining a neural network, wherein the neural network approximates unknown nonlinear terms in the multi-agent system, the commentator network carries out formation control of the agent system, and the effect of the formation control is evaluated and improved.
The reinforcement learning structure of a single reviewer in this embodiment is shown in fig. 2. The single commentator reinforcement learning structure can remove the requirement on an actor network in the traditional actor-commentator reinforcement learning method, thereby effectively reducing the approximation error of a system and reducing the calculation time.
In the first step, the model expression of the multi-agent system is as follows:
wherein ,xi (t) represents the amount of location of the ith agent in the system; u (u) i (t) represents the control amount of the ith agent in the system; f (f) i (. Cndot.) represents an unknown nonlinear function, which is assumed here to be Li Pu Hith continuous.
While the expected trajectory change of the leader agent is expressed by the following formula:
wherein ,pl and vl The trajectory and the speed of the pilot, i.e. the desired trajectory and speed in the movement of the formation, respectively.
In this embodiment, a communication topology diagram of the nonlinear multi-agent system is shown in fig. 3.
In the second step, according to the constructed multi-agent system model of the leader and the follower, the tracking error of each agent relative to the pilot is set as follows:
z i =x i -p l -ζ i
wherein ,ζi Representing the location between the pilot agent and the ith follower agent for describing the formation shape of the system.
From the structure of the tracking error, a formation error form is defined as follows:
wherein ,aij An ith row j column of the adjacency matrix in the graph theory; b i The connection weight parameter between the ith follower agent and the navigator agent is set; Λi represents the neighbor set of the ith agent.
In the third step, the expression form of the cost function is obtained by combining the optimal control related knowledge and the defined formation error:
wherein c=diag { C 1 ,c 2 ,…,c i ,…,c n };w 1 and w2 Constants are set for two; i m Is a proper dimensionA unit matrix of degrees; />Is a tensor product sign.
According to the obtained cost function, a corresponding value function is established, and the optimal control input is introducedThe corresponding optimal value function is finally obtained and expressed as follows:
where τ represents the integration constant.
In the fourth step, a distributed solving method is established according to the established optimal value function, and the Hamiltonian-Gu Kebi-Belman equation is obtained as follows:
solving the equation above for its optimal control inputThe expression to obtain the optimal control input is:
from this equation, it can be seen that the optimal control input required in this embodiment is a quantity related to the derivative of the optimal value function. However, due to the non-linearities of the multi-agent system and the unknown model, the derivative terms of the optimal value function are actually difficult to solve, which also results in a difficult solution of the optimal control input.
Step five: neural network algorithms have been shown to have a powerful approximation that can approximate nonlinear functions. For multiple agentsUnknown nonlinear term f existing in the system i (x i ) Approximate estimation is performed by introducing a neural network:
wherein ,representing an ideal neural network weight matrix; s is S fi (x i ) Representing a basis function vector; e-shaped article fi (x i ) Representing the approximation error.
Due toFor theoretical analysis only but in practice an unknown matrix, so an estimated matrix is introduced +.>Estimating to obtain the approximation of the neural network identifier>The following are provided:
representing a pair of actual nonlinear functions f generated by introducing a neural network method i (x i ) Is a function of the approximation of (a).
From the resulting approximation functionCan thus get->Related oneSome variables-> and />And the corresponding estimated values.
Estimating the matrixThe update is needed, and the corresponding update law can be expressed as the following form by design:
wherein ,Ti Representing a positive definite matrix; θ i Representing a positive constant.
Step six: from the above-mentioned approximation variables obtained through the neural network, the optimum value function is divided to obtain a divided optimum value function expression form as follows:
wherein ,ki Representing a constant term greater than zero; and is also provided withThe expression of (2) is
Expressed by this isolated value function, combined with an optimal control expression patternThe isolated expression pattern for optimal control was obtained as follows:
step seven: by introducing a reinforcement learning structure based on a single commentator, performing approximate evaluation on the segmented optimal value function and the optimal control input, and obtaining the following expression:
wherein ,representing an introduced estimated critic parameter matrix; s is S i Representing the neural network radial basis functions. In the traditional actor commentator reinforcement learning structure, an actor network is required to execute control actions in a controller, and the commentator network only needs to evaluate the control actions in an optimal value function and feed back to the actor network for correction. In the invention, the actor network is removed by design, and the critics neural network is required to bear the responsibility of the actor network in the traditional method, namely, to execute the control action besides evaluating the control action.
Parameter matrix in commentator networkIt is also necessary to update the update law expressed in the form of:
wherein ,kci Represent the learning rate of the commentator network, phi i The specific expression of (2) is as follows:
in order to prove that the optimal control input based on the single commentator reinforcement learning structure provided by the embodiment can realize the formation movement behavior of the nonlinear multi-agent system, corresponding simulation experiments are carried out, and the expression form of the nonlinear multi-agent system is as follows:
wherein ,hi =-0.7,0.1,-0.5,0.1;And the initial positions of the four follower intelligent agents are set as x i (0)=[4,4] T ,[-4,4] T ,[4,-4] T ,[-4,-4] T 。
The expected movement track set by the leader agent is:
wherein the initial position of the leader agent is [0,0] T 。
The information exchange between the agents needs to use the related knowledge of graph theory, and the matrix A is a communication weight matrix for describing the follower agents and the neighbor follower agents, and is expressed as follows:
the matrix B is a communication weight matrix for expressing the follower agent and the leader agent, and is expressed as follows:
B=diag{1,0,0,0}
fig. 3 shows a communication topology diagram of a nonlinear multi-agent system in an embodiment of the present invention, and a multi-agent system in a conventional actor commentator method for comparison, which uses a secondary communication topology as a manner of inter-agent communication, thereby facilitating comparison. Fig. 4 shows a multi-agent formation track according to an embodiment of the present invention, where it can be seen that four follower agents can well follow the track of a pilot agent to move. FIG. 5 illustrates a multi-agent system formation speed trajectory in which four follower agents can keep up with the speed of a pilot agent in an embodiment of the invention. Fig. 6 is a diagram showing a comparison between a position error and a position error in a conventional actor commentator method in an embodiment of the present invention, and it can be seen that the position error of the proposed method is smaller than that obtained by the conventional method. Fig. 7 shows a comparison of speed errors in an embodiment of the present invention with speed errors in a conventional method, and it can be seen that the speed errors in the two methods are relatively close. Fig. 8 shows a comparison diagram of calculation time of an actor commentator method and a single commentator reinforcement learning structure in an embodiment of the present invention, it can be seen that the calculation time of the single commentator reinforcement learning structure method provided by the present invention is shorter than that of the actor commentator method, and the reduction time is more as the iteration number increases.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present invention still fall within the protection scope of the technical solution of the present invention.
Claims (8)
1. A multi-agent formation control method based on a single commentator reinforcement learning structure is characterized by comprising the following steps:
step one: based on graph theory in application mathematics, constructing a communication structure of each intelligent agent of the multi-intelligent agent system, considering that the system is a first-order multi-intelligent agent system, and only obtaining the position information of the neighbor intelligent agent by each intelligent agent; meanwhile, one navigator intelligent body exists in the system, and other intelligent bodies serve as followers to move along the track of the navigator intelligent body in the running process;
step two: aiming at each agent in the system, constructing tracking errors of the agent relative to the navigator agent according to the obtained neighbor agent information, and constructing errors for describing the agent and the navigator and the neighbor agent according to the tracking errors, namely formation errors;
step three: constructing a cost function and a value function related to formation errors and optimal control inputs based on optimal control;
step four: based on the Taylor formula and the value function obtained in the step three, performing expansion solution on the value function to obtain a corresponding Hamiltonian-Gu Kebi-Bellman equation;
step five: aiming at the Hamiltonian-Gu Kebi-Belman equation obtained in the step four, solving the partial derivative about optimal control to obtain the expression form of the optimal control input about an optimal value function;
step six: dividing the optimal value function to obtain the expression form of the optimal value function about formation errors and unknown functions, and obtaining a divided optimal control input form according to the optimal control input expression form of the step five;
step seven: and D, introducing a single commentator reinforcement learning structure, and solving the segmented optimal value function and optimal control input obtained in the step six by combining a neural network, wherein the neural network approximates unknown nonlinear terms in the multi-agent system, the commentator network carries out formation control of the agent system, and the effect of the formation control is evaluated and improved.
2. The multi-agent formation control method based on the single critique reinforcement learning structure according to claim 1, wherein the single critique reinforcement learning structure is used for removing the requirement for an actor network in the traditional actor-critique reinforcement learning method, so that the approximation error of a system is effectively reduced and the calculation time is reduced.
3. The multi-agent formation control method based on the single-commentator reinforcement learning structure according to claim 1, wherein in the first step, a model of the multi-agent system is expressed as:
wherein ,xi (t) represents the amount of location of the ith agent in the system; u (u) i (t) represents the control amount of the ith agent in the system; f (f) i (. Cndot.) represents an unknown nonlinear function and is assumed to be Li Pu Hith continuous;
the model of the navigator agent is as follows:
wherein ,pl and vl The track and speed of the navigator are respectively represented, namely, the expected track and speed in formation movement;
the tracking error of each intelligent agent relative to the pilot is set as follows:
z i =x i -p l -ζ i
wherein ,ζi Representing a location between a pilot agent and an ith follower agent for describing a formation shape of the system;
from the structure of the tracking error, a formation error form is defined as follows:
wherein ,aij An ith row j column of the adjacency matrix in the graph theory; b i The connection weight parameter between the ith follower agent and the navigator agent is set; Λ type i Indicating the neighborhood of the ith agentAnd (5) residing and collecting.
4. The multi-agent formation control method based on the single critique reinforcement learning structure according to claim 3, wherein in the third step, the expression form of the cost function is obtained by combining the defined formation errors:
wherein c=diag { C 1 ,c 2 ,…,c i ,…,c n };w 1 and w2 Constants are set for two; i m Is a unit matrix with proper dimension; />Is a tensor product sign;
according to the obtained cost function, a corresponding value function is established, and the optimal control input is introducedThe corresponding optimal value function is finally obtained and expressed as follows:
where τ represents the integration constant.
5. The multi-agent formation control method based on the single critique reinforcement learning structure according to claim 4, wherein in the fourth step, a hamilton-Gu Kebi-bellman equation is established as follows:
6. the multi-agent formation control method based on the single critique reinforcement learning structure according to claim 5, wherein the unknown nonlinear term f existing in the multi-agent system is aimed at i (x i ) Approximate estimation is performed by introducing a neural network:
wherein ,representing an ideal neural network weight matrix; s is S fi (x i ) Representing a basis function vector; e-shaped article fi (x i ) Representing an approximation error;
due toFor theoretical analysis only but in practice an unknown matrix, so an estimated matrix is introduced +.>Estimating to obtain the approximation of the neural network identifier>The following are provided:
7. The multi-agent formation control method based on the single critique reinforcement learning structure according to claim 6, wherein the optimal value function and the optimal control input are converted into the following expression form by means of dividing parameters:
8. The multi-agent formation control method based on the single-reviewer reinforcement learning structure according to claim 7, wherein the expression of the optimal value function and the optimal control input after the single-reviewer reinforcement learning structure is introduced is as follows:
wherein ,representing an introduced estimated commentator network parameter matrix; s is S i Representing a neural network radial basis function; critics network parameter matrix->The update law of (c) is expressed as follows:
wherein ,kci Represent the learning rate of the commentator network, phi i The specific expression of (2) is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310081638.1A CN116185020A (en) | 2023-01-19 | 2023-01-19 | Multi-agent formation control method based on single commentator reinforcement learning structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310081638.1A CN116185020A (en) | 2023-01-19 | 2023-01-19 | Multi-agent formation control method based on single commentator reinforcement learning structure |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116185020A true CN116185020A (en) | 2023-05-30 |
Family
ID=86435960
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310081638.1A Pending CN116185020A (en) | 2023-01-19 | 2023-01-19 | Multi-agent formation control method based on single commentator reinforcement learning structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116185020A (en) |
-
2023
- 2023-01-19 CN CN202310081638.1A patent/CN116185020A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110597061B (en) | Multi-agent fully-distributed active-disturbance-rejection time-varying formation control method | |
Badgwell et al. | Reinforcement learning–overview of recent progress and implications for process control | |
CN110181508B (en) | Three-dimensional route planning method and system for underwater robot | |
CN111897224B (en) | Multi-agent formation control method based on actor-critic reinforcement learning and fuzzy logic | |
Xu et al. | Two-layer distributed hybrid affine formation control of networked Euler–Lagrange systems | |
CN110658821A (en) | Multi-robot anti-interference grouping time-varying formation control method and system | |
CN112947575B (en) | Unmanned aerial vehicle cluster multi-target searching method and system based on deep reinforcement learning | |
Shi et al. | A learning approach to image-based visual servoing with a bagging method of velocity calculations | |
CN112947086B (en) | Self-adaptive compensation method for actuator faults in formation control of heterogeneous multi-agent system consisting of unmanned aerial vehicle and unmanned vehicle | |
Wang et al. | Command filter based globally stable adaptive neural control for cooperative path following of multiple underactuated autonomous underwater vehicles with partial knowledge of the reference speed | |
CN113900380A (en) | Robust output formation tracking control method and system for heterogeneous cluster system | |
CN114237041B (en) | Space-ground cooperative fixed time fault tolerance control method based on preset performance | |
CN116700327A (en) | Unmanned aerial vehicle track planning method based on continuous action dominant function learning | |
CN113472242A (en) | Anti-interference self-adaptive fuzzy sliding film cooperative control method based on multiple intelligent agents | |
Belmonte-Baeza et al. | Meta reinforcement learning for optimal design of legged robots | |
CN111798494A (en) | Maneuvering target robust tracking method under generalized correlation entropy criterion | |
Kim et al. | TOAST: Trajectory Optimization and Simultaneous Tracking Using Shared Neural Network Dynamics | |
CN111176324B (en) | Method for avoiding dynamic obstacle by multi-unmanned aerial vehicle distributed collaborative formation | |
Fan et al. | Spatiotemporal path tracking via deep reinforcement learning of robot for manufacturing internal logistics | |
CN113485323B (en) | Flexible formation method for cascading multiple mobile robots | |
CN116449703A (en) | AUH formation cooperative control method under finite time frame | |
CN116185020A (en) | Multi-agent formation control method based on single commentator reinforcement learning structure | |
CN114063438B (en) | Data-driven multi-agent system PID control protocol self-learning method | |
CN114200830B (en) | Multi-agent consistency reinforcement learning control method | |
Hwang et al. | Fuzzy adaptive finite-time cooperative control with input saturation for nonlinear multiagent systems and its application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |