CN112434792A - Reinforced learning algorithm for cooperative communication and control of multi-agent system - Google Patents
Reinforced learning algorithm for cooperative communication and control of multi-agent system Download PDFInfo
- Publication number
- CN112434792A CN112434792A CN202011278974.8A CN202011278974A CN112434792A CN 112434792 A CN112434792 A CN 112434792A CN 202011278974 A CN202011278974 A CN 202011278974A CN 112434792 A CN112434792 A CN 112434792A
- Authority
- CN
- China
- Prior art keywords
- agent
- communication
- control
- message
- strategy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
Abstract
This patent discloses a reinforcement learning method for multi-agent system communication and control. The method provides a reinforcement learning algorithm for a multi-agent system which carries out information sharing by sending and receiving messages through a communication network with a certain topological structure, so that the multi-agent system can construct a communication strategy and a control strategy on each agent through training, the agents extract effective low-dimensional communication information from high-dimensional original input of sensing equipment, and the whole multi-agent system can realize efficient information sharing and cooperative control. The method reduces the complexity of the design of the communication and control strategy of the multi-agent system with complex dynamic and high-dimensional observation, and simultaneously reduces the communication load between agents.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence and machine learning, and relates to a reinforcement learning algorithm for cooperative communication and control of a multi-agent system.
Background
A multi-agent system is made up of a plurality of interacting agents, each agent having certain sensing, computing and execution capabilities and being able to communicate with other agents via a communication network. The objective of the multi-agent cooperative communication and control is to enable a plurality of agents to form mutual cooperation through designing reasonable communication and control strategies, and further to complete tasks which are difficult to be independently completed by a single agent. The intelligent agents in the multi-agent system can be embodied into different entities in practical application in different fields, such as aircrafts, mobile robots, traffic lights, power network nodes and the like, and can play an important role in the applications of aircraft formation flying, mobile robot cooperative carrying, urban traffic network intelligent control, intelligent power grid system control and the like after reasonable communication and control strategies are given. The existing multi-agent control algorithm establishes a differential equation model of an agent by analyzing the kinematics of the agent, designs a corresponding communication protocol and a controller based on a control theory, and finally completes the problems of consistency, formation, optimization and the like of the agent. However, when the kinematics of the agent is complex and the dimension of the sensing data is high, it is difficult to describe the agent by using a differential equation, so that the agent cannot be controlled by using the existing method based on the control theory.
Disclosure of Invention
In view of the above problems, the present invention provides a reinforcement learning algorithm for multi-agent system cooperative communication and control, which provides a reinforcement learning algorithm for multi-agent system information sharing by sending and receiving messages through a communication network with a certain topology, so that the multi-agent system can construct a communication strategy and a control strategy on each agent through training, so that the whole multi-agent system can realize efficient information sharing and finally complete the task of cooperative control, the present invention provides a reinforcement learning algorithm for multi-agent system cooperative communication and control, characterized in that the multi-agent system comprises N agents, wherein any agent is represented as agent-i, i is the unique sequence number of the agent in the multi-agent system, and the agent observes the external environment, the observation results are expressed asIs a real number vector, wherein the superscript t and the subscript i represent the agent-i, observing the external environment at the moment t, wherein the intelligent agent executes control braking to influence the self state and the external environment, and the control action is expressed asA real number vector, wherein the superscript t and the subscript i denote the control actions performed by the agent-i at time t, the multi-agent system, communicating using a particular communication network, each agent being capable of sending a message to another determined agent; each intelligent agent can also receive a message sent by a specific intelligent agent, the sending and receiving rules of the messages among the intelligent agents obey a specific communication topological relation, the specific communication topological relation is a directed ring topological structure, and specifically, the intelligent agent-i can only receive the information sent by the intelligent agent- (i-1) and simultaneously sends the message to the intelligent agent- (i + 1);
the message received by agent-i is represented asThe subscript i represents the message received by the agent-i at the time t, and the message is also the message sent by the agent- (i-1) at the time t;
the message sent by agent-i is represented asWherein, the subscript t and the subscript i +1 represent the message sent by the agent-i at the time t, and the message is also the message received by the agent- (i +1) at the time t;
the communication strategy constructed on the agent-i is represented by a neural network, specifically represented as
Wherein theta isiThe inputs of the communication strategy are the observed results of the agent-i for all weight parameters of the neural networkThe output of the communication policy being a transmitted messageFor agent-N, in particular
The control strategy constructed on the agent-i is represented by a neural network, specifically represented as
Wherein muiThe inputs of the control strategy are the observed results of the agent-i for all weight parameters of the neural networkAnd the received messageThe output of the control strategy being a control action
The reinforcement learning algorithm for the communication and control of the multi-agent system comprises the following algorithm steps:
step S1, constructing a global evaluator, wherein the global evaluator is represented by a neural network, and is specifically represented as Qw(Ot,At) Wherein w is all weight parameters of the neural network,representing the set of observations of all agents of the external environment at time t,represents the set of control actions of all agents at time t;
step S2, initializing the global evaluator neural network, all agent communication strategy neural networks and all agent control strategy neural networks by using random numbers, and additionally constructing a target network of the global evaluator neural network, wherein the target network is represented as Qw′(Ot,At) The network weight parameter is w', the neural network structure is completely the same as the global evaluator neural network, and the initial weight parameter is also completely the same;
step S3, each agent and the environment interact in a plurality of time steps, and simultaneously, by sending and receiving messages, the interaction experience data is stored;
step S4, randomly extracting K sets of empirical data from the empirical data memory: [ e ] a1,e2,...,eK]Any set of empirical data extracted is denoted as ej=(Oj,Aj,rj,O′j) Wherein the subscript j represents the sequence number in the extracted K sets of data, wherein OjCan be unfolded and expressed as Oj={oj1,oj2,...,ojN}, likewise, Aj={aj1,aj2,...,ajN}、O′j={o′j1,o′j2,...,o′jN};
Step S5, training global evaluator Q using the extracted empirical dataw(Ot,At);
Step S6, training the communication strategy of each agent by using the extracted empirical data, and updating the weight parameter theta of the communication strategy network of the agent-i by using the following formulai:
Where α is the update rate of the network weight parameter, and the subscript ρ represents the number of the agent receiving the message sent by the agent-i, specifically, for agent-1 … (N-1), ρ ═ i +1, and for agent-N, ρ ═ 1, message m in the formulaρBy agents-iSubstituting the communication strategy into the corresponding observed value to obtain the communication strategy through calculation, wherein the calculation method is as follows:
in the formulaIs a set of control actions, except aρOther actions than these are derived from empirical data AjIs extracted fromρCalculated by the following formula:
step S7, training the control strategy of each agent by using the extracted empirical data, and updating the weight parameter mu of the control strategy network of the agent-i by using the following formulai:
Wherein the message miThe communication strategy of the agent- (i-1) is brought into a corresponding observation value to be calculated, and the calculation method is as follows:
in the formulaIs a set of control actions, except aiOther actions than these are derived from empirical data AjIs extracted fromiCalculated by the following formula:
other unexplained symbols in the formula have the same meaning as the symbols in the formula described in step S6;
and S8, repeatedly executing the steps S3 to S7 by using the new communication strategy and the new control strategy, and iteratively updating the communication strategy and the control strategy neural network of each agent until the weight parameters of all the agent communication strategy neural networks and the control strategy neural networks are converged, so that the algorithm can be ended, and the cooperative communication strategy and the control strategy of the multi-agent system are obtained.
As a further improvement of the invention, the intelligent agent-1 can only receive the message sent by the intelligent agent-N and simultaneously send the message to the intelligent agent-2; the agent-N can only receive the message sent by the agent- (N-1) and send the message to the agent-1.
As a further improvement of the present invention, the step S3 specifically comprises the following steps:
including step S3-1 to step S3-6;
Step S3-2, agent-i uses a communication policyCalculating to obtain a messageThen send the message to agent- (i + 1);
step S3-3, agent-i receives messageAnd by control strategiesGet a control actionThen the action is executed;
step S3-4, the agent-i observes the environment to obtain the observation value of the next stepObtaining feedback reward value scalar r given by external environmentt;
Step S3-5, saving the observation value set of all agents from step S3-1 to S3-4Set of next step observationsSet of action valuesAnd feeding back the prize value rtCombining three sets and a scalar into one experience tuple, denoted as ek=(Ot,At,rt,O′t)kAnd storing the data into an empirical data memory, wherein a subscript k represents a sequence number of the group of data in the empirical data memory;
step S3-6, repeatedly executing S3-1 to step S3-5 until sufficient interaction experience data is obtained.
As a further improvement of the invention, the specific steps of step S5 are as follows;
including step S5-1 to step S5-3;
step S5-1, calculate the gradient of the global evaluator loss function l (w) according to the following formula:
wherein y isj=rj+γQw′(O′j,A′j) Here, the calculation uses the target network of the global evaluator neural network, where γ is the attenuation factorIn the case of a hybrid vehicle,is a gradient symbol of'jThe method is obtained through the current control strategy, the observation quantity of the next step and the information, and is calculated by the following formula:
observed value o 'of next step of each agent in formula'j1,...,o′jNFrom empirical data O'jThe next message m 'of each agent in the formula is obtained'j1,...,m′jNThe observation value is calculated by taking the communication strategy of each agent into the next step, and the specific calculation method is as follows:
step S5-2, updating weight parameters of the global evaluator neural network by using the gradient of the global evaluator loss function, wherein the specific updating method is as follows:
wherein arrow ← is the update assignment symbol, and α is the update rate of the network weight parameter;
step S5-3, updating global evaluator neural network Qw′(Ot,At) The weight parameter w' of the target network is specifically updated according to the following formula:
w′←ηw+(1-η)w′
wherein η is the update rate of the network weight parameter.
Has the advantages that:
1) the intelligent agent can extract effective communication information from the original high-dimensional input of the sensing equipment, so that the multi-intelligent-agent system can realize efficient information sharing, and meanwhile, the communication load is reduced;
2) the communication strategy and the control strategy of the intelligent agent are constructed by using the reinforcement learning algorithm, analysis and mathematical modeling are not needed to be carried out on the dynamic characteristics of the intelligent agent, and the complexity of the design of the communication and control strategy of the multi-intelligent-agent system with complex dynamic and high-dimensional observation is reduced.
Drawings
FIG. 1 is a flow chart of the composition structure and information processing of a mobile robot according to the present invention;
fig. 2 is a diagram of a multi-mobile-robot communication network structure of the present invention.
Detailed Description
The invention is described in further detail below with reference to the following detailed description and accompanying drawings:
the invention provides a reinforcement learning algorithm for cooperative communication and control of a multi-agent system, which aims at the multi-agent system for information sharing by sending and receiving messages through a communication network with a certain topological structure, and provides the reinforcement learning algorithm, so that the multi-agent system can construct a communication strategy and a control strategy on each agent through training, the whole multi-agent system can realize efficient information sharing, and finally, the task of cooperative control is completed.
The following describes an embodiment of the method disclosed by the invention for a multi-agent system consisting of three mobile robots and performing cooperative control.
Fig. 1 shows a mobile robot composition and an information processing flow.
101 is a sensing device of the robot for observing external environmental obstacles, self-position information, and the like and converting them into measurement data. The sensing equipment can specifically comprise common mobile robot sensing equipment such as a camera, an ultrasonic distance meter, a laser distance meter, a speedometer, a global positioning system receiver, an ultra-wideband positioning receiver and the like.
102 is a communication device (receive) for receiving a communication message. 106 is a communication device (transmit) for transmitting communication messages. The communication equipment can be common communication equipment on the mobile robot and comprises a Bluetooth communication module, a ZigBee communication module, a WiFi communication module and the like.
And 107, driving equipment, which is used for driving the mobile robot to move according to the input control quantity, can be common mobile robot driving equipment and can comprise equipment such as a motor, wheels, a crawler belt, a steering engine and the like.
The specific information processing mode of the communication strategy module is as follows: and inputting the measurement data obtained by the sensing equipment into a communication strategy neural network for operation to obtain a sending message, transmitting the sending message to the communication equipment (sending), and then sending the sending message to another mobile robot.
The specific information processing mode of the control strategy module is as follows: and inputting the measurement data obtained by the sensing equipment and the message received by the communication equipment (receiving) into a control strategy neural network for operation, obtaining a control quantity and transmitting the control quantity to the driving equipment to control the movement of the mobile robot.
Fig. 2 is a diagram of a multi-mobile robot communication network architecture. The network is divided into two parts: a communication network between the mobile robot and the mobile robot (indicated by solid arrows in fig. 2), and a communication network between the mobile robot and the central training computer (indicated by dashed arrows in fig. 2).
The specific communication mode of the communication network between the mobile robots is as follows: the first mobile robot receives the message sent by the third mobile robot and sends the message to the second mobile robot; the second mobile robot receives the message sent by the first mobile robot and sends the message to the third mobile robot; the third mobile robot receives the message sent by the second mobile robot and sends the message to the second mobile robot.
The specific communication mode of the communication network between the mobile robot and the central training computer is as follows: and each mobile robot launches the measured data, the control action, the weight parameters of the control strategy neural network and the communication neural network to the central training computer, and the central training computer correspondingly sends the updated weight parameters of the control strategy neural network and the communication neural network back to each mobile robot.
The specific execution mode of the reinforcement learning algorithm for cooperative communication and control of the multi-agent system on the cooperative control task of the three mobile robots is given by the following steps:
step S1, constructing a global evaluator on the central training computer, wherein the global evaluator is represented by a neural network and is represented as Qw(Ot,At) Wherein w is all weight parameters of the neural network,represents a set of observations of three mobile robots at time t,a set of control operations of the three mobile robots at time t is shown. Respectively constructing communication strategy neural network on information processing computer of each mobile robotAnd control strategy neural network
And step S2, initializing the global evaluator neural network, the three mobile robot communication strategy neural networks and the control strategy neural network by using random numbers. Additionally constructing a target network of a global evaluator neural network, denoted as Qw′(Ot,At) The network weight parameter is w', the neural network structure is completely the same as the global evaluator neural network, and the initial weight parameter is also completely the same.
In step S3, the three mobile robots observe the environment, communicate with each other, and communicate with the central training computer. The concrete steps include step S3-1 to step S3-5.
Step S3-1, the three mobile robots respectively observe the environment by using the sensing equipment, and the first mobile robot obtains the observed quantityThe second mobile robot obtains the observed quantityThe third mobile robot obtains the observed quantity
Step S3-2, the first mobile robot calculates to obtain a message by using a communication strategy moduleThen sending the message to a second mobile robot using the communication device; the second mobile robot calculates to obtain a message by using the communication strategy moduleThen sending the message to a third mobile robot using the communication device; the third mobile robot obtains the message by calculation through the communication strategy moduleThen sending the message to the first mobile robot using the communication device; the calculation and transmission of the three mobile robots in this step may be performed simultaneously.
Step S3-3, the first mobile robot receives the message using the communication deviceAnd obtains control action through a control strategy moduleThe second mobile robot receiving the message using the communication deviceAnd obtains control action through a control strategy moduleThe third mobile robot receiving the message using the communication deviceAnd obtains control action through a control strategy moduleThree mobile robots respectively control the actionsAn input driving device that drives a motion of the mobile robot; in this step, the three mobile robots receiving the message, calculating the control action, and inputting the control action to the driving device may be performed simultaneously.
Step S3-4, the three mobile robots respectively use the sensing equipment to observe the environment to obtain the environment observed value of the next step,while obtaining the feedback prize value scalar r given by the external environmentt。
At step S3-5, the three mobile robots transmit the environment observation values, the control action values, and the feedback award values obtained at steps S3-1 to S3-4 to the central training computer through the communication device.
Step S3-6, the central training computer receives the environmental observation value sets of three mobile robotsSet of next step observationsControl action setFeedback prize value rt. The received data is merged into an experience tuple denoted as ek=(Ot,At,rt,O′t)kAnd saves the tuple to the empirical data store of the central training computer.
Step S3-7, repeating steps S3-1 to S3-6 until sufficient interactive experience data (experience tuple quantity >10000 pieces) is saved by the central training computer.
Step S4, randomly extracting K sets of empirical data from the empirical data storage on the central training computer: [ e ] a1,e2,...,eK]Any set of empirical data extracted is denoted as ej=(Oj,Aj,rj,O′j) Wherein the subscript j represents the sequence number in the extracted K sets of data, wherein OjCan be unfolded and expressed as Oj={oj1,oj2,oj3}, likewise, Aj={aj1,aj2,aj3}、O′j={o′j1,o′j2,o′j3}。
Step S5, training the global evaluator Q on the central training computer using the extracted empirical dataw(Ot,At) Specifically, the method includes steps S5-1 to S5-3.
Step S5-1, calculate the gradient of the global evaluator loss function l (w) according to the following formula:
wherein y isj=rj+γQw′(O′j,A′j) Here, the target network of the global evaluator neural network is used for the calculation. Wherein gamma is a decay factor, wherein gamma is a linear function,is a sign of a gradient, whichOf medium A'jThe method is obtained through the current control strategy, the observation quantity of the next step and the information, and is calculated by the following formula:
observed values o 'of next steps of three mobile robots in formula'j1,o′j2,o′j3From empirical data O'jAnd (4) obtaining. Message m 'of next step of three mobile robots in formula'j1,m′j2,m′j3And carrying out calculation on the observation value of the next step by the communication strategy of each mobile robot, wherein the specific calculation method is as follows:
step S5-2, updating weight parameters of the global evaluator neural network by using the gradient of the global evaluator loss function, wherein the specific updating method is as follows:
wherein arrow ← is the update assignment symbol, and α is the update rate of the network weight parameter.
Step S5-3, updating global evaluator neural network Qw′(Ot,At) The weight parameter w' of the target network is specifically updated according to the following formula:
w′←ηw+(1-η)w′
wherein η is the update rate of the network weight parameter.
And step S6, calculating weight parameters of the new communication strategy neural network of each mobile robot by using the extracted empirical data. For the first mobile robot, a new communication strategy weight parameter theta'1The following formula is used for calculation:
message m in formula2The communication strategy of the first mobile robot is substituted into the corresponding observed value to be calculated, and the calculation method is as follows:
in the formulaIs a set of control actions, except a2Other actions than these are derived from empirical data AjIs extracted from2Calculated by the following formula:
for the second and third mobile robots, a similar method can be used to obtain weight parameter θ 'of the new communication policy neural network'2And θ'3。
And step S7, calculating weight parameters of the new control strategy neural network of each mobile robot by using the extracted empirical data. For the first mobile robot, the new control strategy neural network weight parameter mu'iThe following formula is used for calculation:
wherein the message m1The communication strategy of the third mobile robot is brought into the corresponding observed value to be calculated, and the calculation method is as follows:
in the formulaIs a set of control actions, except a1Other actions than these are derived from empirical data AjIs extracted from1Calculated by the following formula:
other unexplained symbols in the formula are the same as the symbols in the formula described in step S6. For the second and third mobile robots, a similar method can be used to obtain a new control strategy neural network weight parameter μ'2And mu'3。
Step S8, the central training computer calculates the new communication strategy and control strategy neural network parameters theta 'of the first mobile robot'1And mu'1The communication strategy and control strategy neural network is sent to the information processing computer of the first mobile robot through the communication equipment, and the communication strategy and control strategy neural network of the previous step are updated. The communication strategy and control strategy neural networks of the second and third mobile robots are transmitted and updated in the same manner.
And step S9, repeatedly executing the steps S3 to S8 by using the new communication strategy and control strategy, and iteratively updating the communication strategy and control strategy neural networks of the three mobile robots until weight parameters of the communication strategy neural networks and control strategy neural networks of all the mobile robots converge, so that the cooperative communication and control strategy for the three mobile robots is obtained.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, but any modifications or equivalent variations made according to the technical spirit of the present invention are within the scope of the present invention as claimed.
Claims (4)
1. An reinforcement learning algorithm for cooperative communication and control of multi-agent system, characterized in thatThe multi-agent system comprises N agents, wherein any agent is represented as an agent-i, i is the unique serial number of the agent in the multi-agent system, the agent observes the external environment, and the observation result is represented asThe real number vector is represented, wherein an upper mark t and a lower mark i represent that an intelligent agent-i observes the external environment at the moment t, the intelligent agent executes control braking to influence the self state and the external environment, and the control action is represented asA real number vector, wherein the superscript t and the subscript i denote the control actions performed by the agent-i at time t, the multi-agent system, communicating using a particular communication network, each agent being capable of sending a message to another determined agent; each intelligent agent can also receive a message sent by a specific intelligent agent, the sending and receiving rules of the messages among the intelligent agents obey a specific communication topological relation, the specific communication topological relation is a directed ring topological structure, and specifically, the intelligent agent-i can only receive the information sent by the intelligent agent- (i-1) and simultaneously sends the message to the intelligent agent- (i + 1);
the message received by agent-i is represented asThe subscript i represents the message received by the agent-i at the time t, and the message is also the message sent by the agent- (i-1) at the time t;
the message sent by agent-i is represented asWherein, the subscript t and the subscript i +1 represent the message sent by the agent-i at the time t, and the message is also the message received by the agent- (i +1) at the time t;
the communication strategy constructed on the agent-i is represented by a neural network, specifically represented as
Wherein theta isiThe inputs of the communication strategy are the observed results of the agent-i for all weight parameters of the neural networkThe output of the communication policy being a transmitted messageFor agent-N, in particular
The control strategy constructed on the agent-i is represented by a neural network, specifically represented as
Wherein muiThe inputs of the control strategy are the observed results of the agent-i for all weight parameters of the neural networkAnd the received messageThe output of the control strategy being a control action
The reinforcement learning algorithm for the communication and control of the multi-agent system comprises the following algorithm steps:
step S1, constructing global commentThe valuator, global evaluator is represented using a neural network, specifically denoted as Qw(ot,At) Wherein w is all weight parameters of the neural network,representing the set of observations of all agents of the external environment at time t,represents the set of control actions of all agents at time t;
step S2, initializing the global evaluator neural network, all agent communication strategy neural networks and all agent control strategy neural networks by using random numbers, and additionally constructing a target network of the global evaluator neural network, wherein the target network is represented as Qw′(ot,At) The network weight parameter is w', the neural network structure is completely the same as the global evaluator neural network, and the initial weight parameter is also completely the same;
step S3, each agent and the environment interact in a plurality of time steps, and simultaneously, by sending and receiving messages, the interaction experience data is stored;
step S4, randomly extracting K sets of empirical data from the empirical data memory: [ e ] a1,e2,...,eK]Any set of empirical data extracted is denoted as ej=(Oj,Aj,rj,O′j) Wherein the subscript j represents the sequence number in the extracted K sets of data, wherein OjCan be unfolded and expressed as Oj={oj1,oj2,...,ojN}, likewise, Aj={aj1,aj2,...,ajN}、O′j={o′j1,o′j2,...,o′jN};
Step S5, training global evaluator Q using the extracted empirical dataw(ot,At);
Step S6, training a communication policy for each agent using the extracted empirical dataIn a similar manner, the weight parameter θ of the communication policy network of agent-i is updated using the following formulai:
Where α is an update rate of a network weight parameter, and a subscript ρ indicates a number of an agent that receives a message sent by an agent-i, specifically, for an agent-1. (N-1), ρ ═ i +1, and for an agent-N, ρ ═ 1, a message m in the formulaρThe communication strategy of the agent-i is substituted into the corresponding observation value to be calculated, and the calculation method is as follows:
in the formulaIs a set of control actions, except aρOther actions than these are derived from empirical data AjIs extracted fromρCalculated by the following formula:
step S7, training the control strategy of each agent by using the extracted empirical data, and updating the weight parameter mu of the control strategy network of the agent-i by using the following formulai:
Wherein the message miThe communication strategy of the agent- (i-1) is brought into a corresponding observation value to be calculated, and the calculation method is as follows:
in the formulaIs a set of control actions, except aiOther actions than these are derived from empirical data AjIs extracted fromiCalculated by the following formula:
other unexplained symbols in the formula have the same meaning as the symbols in the formula described in step S6;
and S8, repeatedly executing the steps S3 to S7 by using the new communication strategy and the new control strategy, and iteratively updating the communication strategy and the control strategy neural network of each agent until the weight parameters of all the agent communication strategy neural networks and the control strategy neural networks are converged, so that the algorithm can be ended, and the cooperative communication strategy and the control strategy of the multi-agent system are obtained.
2. A reinforcement learning algorithm for multi-agent system cooperative communication and control as claimed in claim 1 wherein agent-1 can only receive messages sent by agent-N and send messages to agent-2 at the same time; the agent-N can only receive the message sent by the agent- (N-1) and send the message to the agent-1.
3. The reinforcement learning algorithm for cooperative communication and control of multi-agent system as claimed in claim 1, wherein the step S3 is specifically as follows:
including step S3-1 to step S3-6;
Step S3-2, agent-i uses a communication policyCalculating to obtain a messageThen send the message to agent- (i + 1);
step S3-3, agent-i receives messageAnd by control strategiesGet a control actionThen the action is executed;
step S3-4, the agent-i observes the environment to obtain the observation value of the next stepObtaining feedback reward value scalar r given by external environmentt;
Step S3-5, saving the observation value set of all agents from step S3-1 to S3-4Set of next step observationsSet of action valuesAnd feeding back the prize value rtCombining three sets with a scalarAre combined into an experience tuple denoted as ek=(Ot,At,rt,O′t) k and saving it to the empirical data memory, wherein the subscript k denotes the sequence number of the set of data in the empirical data memory;
step S3-6, repeatedly executing S3-1 to step S3-5 until sufficient interaction experience data is obtained.
4. The reinforcement learning algorithm for multi-agent system cooperative communication and control as claimed in claim 1, wherein the step S5 is as follows;
including step S5-1 to step S5-3;
step S5-1, calculate the gradient of the global evaluator loss function l (w) according to the following formula:
wherein y isj=rj+γQw′(O′j,A′j) The calculation here uses the target network of the global evaluator neural network, where gamma is the attenuation factor,is a gradient symbol of'jThe method is obtained through the current control strategy, the observation quantity of the next step and the information, and is calculated by the following formula:
observed value o 'of next step of each agent in formula'j1,...,o′jNFrom empirical data O'jThe next message m 'of each agent in the formula is obtained'j1,...,m′jNThe observation value is calculated by taking the communication strategy of each agent into the next step, and the specific calculation methodThe following formula:
step S5-2, updating weight parameters of the global evaluator neural network by using the gradient of the global evaluator loss function, wherein the specific updating method is as follows:
wherein arrow ← is the update assignment symbol, and α is the update rate of the network weight parameter;
step S5-3, updating the global evaluator neural network Qw' (O)t,At) The weight parameter w' of the target network is specifically updated according to the following formula:
w′←ηw+(1-η)w′
wherein η is the update rate of the network weight parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011278974.8A CN112434792A (en) | 2020-11-16 | 2020-11-16 | Reinforced learning algorithm for cooperative communication and control of multi-agent system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011278974.8A CN112434792A (en) | 2020-11-16 | 2020-11-16 | Reinforced learning algorithm for cooperative communication and control of multi-agent system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112434792A true CN112434792A (en) | 2021-03-02 |
Family
ID=74701138
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011278974.8A Pending CN112434792A (en) | 2020-11-16 | 2020-11-16 | Reinforced learning algorithm for cooperative communication and control of multi-agent system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112434792A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112966816A (en) * | 2021-03-31 | 2021-06-15 | 东南大学 | Multi-agent reinforcement learning method surrounded by formation |
CN115086374A (en) * | 2022-06-14 | 2022-09-20 | 河南职业技术学院 | Scene complexity self-adaptive multi-agent layered cooperation method |
-
2020
- 2020-11-16 CN CN202011278974.8A patent/CN112434792A/en active Pending
Non-Patent Citations (1)
Title |
---|
YUANDA WANG等: "Cooperative control for multi-player pursuit-evasion games with reinforcement learning", 《NEUROCOMPUTING》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112966816A (en) * | 2021-03-31 | 2021-06-15 | 东南大学 | Multi-agent reinforcement learning method surrounded by formation |
CN115086374A (en) * | 2022-06-14 | 2022-09-20 | 河南职业技术学院 | Scene complexity self-adaptive multi-agent layered cooperation method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Learning for multi-robot cooperation in partially observable stochastic environments with macro-actions | |
CN112434792A (en) | Reinforced learning algorithm for cooperative communication and control of multi-agent system | |
Yan et al. | Collision-avoiding flocking with multiple fixed-wing uavs in obstacle-cluttered environments: A task-specific curriculum-based madrl approach | |
CN112183288B (en) | Multi-agent reinforcement learning method based on model | |
Devo et al. | Autonomous single-image drone exploration with deep reinforcement learning and mixed reality | |
Geng et al. | Learning to cooperate in decentralized multi-robot exploration of dynamic environments | |
Luo et al. | Self-imitation learning by planning | |
Wang et al. | A novel hybrid map based global path planning method | |
Zhang et al. | Multi-robot cooperative target encirclement through learning distributed transferable policy | |
Zou et al. | Mobile robot path planning using improved mayfly optimization algorithm and dynamic window approach | |
Chen et al. | When shall i be empathetic? the utility of empathetic parameter estimation in multi-agent interactions | |
Asarkaya et al. | Temporal-logic-constrained hybrid reinforcement learning to perform optimal aerial monitoring with delivery drones | |
CN116227622A (en) | Multi-agent landmark coverage method and system based on deep reinforcement learning | |
Nowak et al. | Self organized UAV swarm planning optimization for search and destroy using SWARMFARE simulation | |
Zarrouki | Reinforcement learning of model predictive control parameters for autonomous vehicle guidance | |
Suh et al. | Optimal motion planning for multi-modal hybrid locomotion | |
Kuyucu et al. | Incremental evolution of fast moving and sensing simulated snake-like robot with multiobjective GP and strongly-typed crossover | |
Khalajzadeh et al. | A review on applicability of expert system in designing and control of autonomous cars | |
Zhang et al. | Build simulation platform in real logistics scenario and optimization based on reinforcement learning | |
Wang et al. | Path planning for air-ground robot considering modal switching point optimization | |
Guan | Self-inspection method of unmanned aerial vehicles in power plants using deep q-network reinforcement learning | |
Sidenko et al. | Machine Learning for Unmanned Aerial Vehicle Routing on Rough Terrain | |
CN116644779A (en) | Object transportation method and device based on multiple intelligent agents | |
Shi et al. | Leader-follower cooperative movement method for multiple amphibious spherical robots | |
Feng et al. | Overview and Application-Driven Motivations of Evolutionary Multitasking |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210302 |