CN112434792A - Reinforced learning algorithm for cooperative communication and control of multi-agent system - Google Patents

Reinforced learning algorithm for cooperative communication and control of multi-agent system Download PDF

Info

Publication number
CN112434792A
CN112434792A CN202011278974.8A CN202011278974A CN112434792A CN 112434792 A CN112434792 A CN 112434792A CN 202011278974 A CN202011278974 A CN 202011278974A CN 112434792 A CN112434792 A CN 112434792A
Authority
CN
China
Prior art keywords
agent
communication
control
message
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011278974.8A
Other languages
Chinese (zh)
Inventor
王远大
孙长银
孙佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202011278974.8A priority Critical patent/CN112434792A/en
Publication of CN112434792A publication Critical patent/CN112434792A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Abstract

This patent discloses a reinforcement learning method for multi-agent system communication and control. The method provides a reinforcement learning algorithm for a multi-agent system which carries out information sharing by sending and receiving messages through a communication network with a certain topological structure, so that the multi-agent system can construct a communication strategy and a control strategy on each agent through training, the agents extract effective low-dimensional communication information from high-dimensional original input of sensing equipment, and the whole multi-agent system can realize efficient information sharing and cooperative control. The method reduces the complexity of the design of the communication and control strategy of the multi-agent system with complex dynamic and high-dimensional observation, and simultaneously reduces the communication load between agents.

Description

Reinforced learning algorithm for cooperative communication and control of multi-agent system
Technical Field
The invention belongs to the technical field of artificial intelligence and machine learning, and relates to a reinforcement learning algorithm for cooperative communication and control of a multi-agent system.
Background
A multi-agent system is made up of a plurality of interacting agents, each agent having certain sensing, computing and execution capabilities and being able to communicate with other agents via a communication network. The objective of the multi-agent cooperative communication and control is to enable a plurality of agents to form mutual cooperation through designing reasonable communication and control strategies, and further to complete tasks which are difficult to be independently completed by a single agent. The intelligent agents in the multi-agent system can be embodied into different entities in practical application in different fields, such as aircrafts, mobile robots, traffic lights, power network nodes and the like, and can play an important role in the applications of aircraft formation flying, mobile robot cooperative carrying, urban traffic network intelligent control, intelligent power grid system control and the like after reasonable communication and control strategies are given. The existing multi-agent control algorithm establishes a differential equation model of an agent by analyzing the kinematics of the agent, designs a corresponding communication protocol and a controller based on a control theory, and finally completes the problems of consistency, formation, optimization and the like of the agent. However, when the kinematics of the agent is complex and the dimension of the sensing data is high, it is difficult to describe the agent by using a differential equation, so that the agent cannot be controlled by using the existing method based on the control theory.
Disclosure of Invention
In view of the above problems, the present invention provides a reinforcement learning algorithm for multi-agent system cooperative communication and control, which provides a reinforcement learning algorithm for multi-agent system information sharing by sending and receiving messages through a communication network with a certain topology, so that the multi-agent system can construct a communication strategy and a control strategy on each agent through training, so that the whole multi-agent system can realize efficient information sharing and finally complete the task of cooperative control, the present invention provides a reinforcement learning algorithm for multi-agent system cooperative communication and control, characterized in that the multi-agent system comprises N agents, wherein any agent is represented as agent-i, i is the unique sequence number of the agent in the multi-agent system, and the agent observes the external environment, the observation results are expressed as
Figure BDA0002780097540000011
Is a real number vector, wherein the superscript t and the subscript i represent the agent-i, observing the external environment at the moment t, wherein the intelligent agent executes control braking to influence the self state and the external environment, and the control action is expressed as
Figure BDA0002780097540000012
A real number vector, wherein the superscript t and the subscript i denote the control actions performed by the agent-i at time t, the multi-agent system, communicating using a particular communication network, each agent being capable of sending a message to another determined agent; each intelligent agent can also receive a message sent by a specific intelligent agent, the sending and receiving rules of the messages among the intelligent agents obey a specific communication topological relation, the specific communication topological relation is a directed ring topological structure, and specifically, the intelligent agent-i can only receive the information sent by the intelligent agent- (i-1) and simultaneously sends the message to the intelligent agent- (i + 1);
the message received by agent-i is represented as
Figure BDA0002780097540000021
The subscript i represents the message received by the agent-i at the time t, and the message is also the message sent by the agent- (i-1) at the time t;
the message sent by agent-i is represented as
Figure BDA0002780097540000022
Wherein, the subscript t and the subscript i +1 represent the message sent by the agent-i at the time t, and the message is also the message received by the agent- (i +1) at the time t;
the communication strategy constructed on the agent-i is represented by a neural network, specifically represented as
Figure BDA0002780097540000023
Wherein theta isiThe inputs of the communication strategy are the observed results of the agent-i for all weight parameters of the neural network
Figure BDA0002780097540000024
The output of the communication policy being a transmitted message
Figure BDA0002780097540000025
For agent-N, in particular
Figure BDA0002780097540000026
The control strategy constructed on the agent-i is represented by a neural network, specifically represented as
Figure BDA0002780097540000027
Wherein muiThe inputs of the control strategy are the observed results of the agent-i for all weight parameters of the neural network
Figure BDA0002780097540000028
And the received message
Figure BDA0002780097540000029
The output of the control strategy being a control action
Figure BDA00027800975400000210
The reinforcement learning algorithm for the communication and control of the multi-agent system comprises the following algorithm steps:
step S1, constructing a global evaluator, wherein the global evaluator is represented by a neural network, and is specifically represented as Qw(Ot,At) Wherein w is all weight parameters of the neural network,
Figure BDA00027800975400000211
representing the set of observations of all agents of the external environment at time t,
Figure BDA00027800975400000212
represents the set of control actions of all agents at time t;
step S2, initializing the global evaluator neural network, all agent communication strategy neural networks and all agent control strategy neural networks by using random numbers, and additionally constructing a target network of the global evaluator neural network, wherein the target network is represented as Qw′(Ot,At) The network weight parameter is w', the neural network structure is completely the same as the global evaluator neural network, and the initial weight parameter is also completely the same;
step S3, each agent and the environment interact in a plurality of time steps, and simultaneously, by sending and receiving messages, the interaction experience data is stored;
step S4, randomly extracting K sets of empirical data from the empirical data memory: [ e ] a1,e2,...,eK]Any set of empirical data extracted is denoted as ej=(Oj,Aj,rj,O′j) Wherein the subscript j represents the sequence number in the extracted K sets of data, wherein OjCan be unfolded and expressed as Oj={oj1,oj2,...,ojN}, likewise, Aj={aj1,aj2,...,ajN}、O′j={o′j1,o′j2,...,o′jN};
Step S5, training global evaluator Q using the extracted empirical dataw(Ot,At);
Step S6, training the communication strategy of each agent by using the extracted empirical data, and updating the weight parameter theta of the communication strategy network of the agent-i by using the following formulai
Figure BDA00027800975400000213
Where α is the update rate of the network weight parameter, and the subscript ρ represents the number of the agent receiving the message sent by the agent-i, specifically, for agent-1 … (N-1), ρ ═ i +1, and for agent-N, ρ ═ 1, message m in the formulaρBy agents-iSubstituting the communication strategy into the corresponding observed value to obtain the communication strategy through calculation, wherein the calculation method is as follows:
Figure BDA0002780097540000031
in the formula
Figure BDA0002780097540000032
Is a set of control actions, except aρOther actions than these are derived from empirical data AjIs extracted fromρCalculated by the following formula:
Figure BDA0002780097540000033
step S7, training the control strategy of each agent by using the extracted empirical data, and updating the weight parameter mu of the control strategy network of the agent-i by using the following formulai
Figure BDA0002780097540000034
Wherein the message miThe communication strategy of the agent- (i-1) is brought into a corresponding observation value to be calculated, and the calculation method is as follows:
Figure BDA0002780097540000035
in the formula
Figure BDA0002780097540000036
Is a set of control actions, except aiOther actions than these are derived from empirical data AjIs extracted fromiCalculated by the following formula:
Figure BDA0002780097540000037
other unexplained symbols in the formula have the same meaning as the symbols in the formula described in step S6;
and S8, repeatedly executing the steps S3 to S7 by using the new communication strategy and the new control strategy, and iteratively updating the communication strategy and the control strategy neural network of each agent until the weight parameters of all the agent communication strategy neural networks and the control strategy neural networks are converged, so that the algorithm can be ended, and the cooperative communication strategy and the control strategy of the multi-agent system are obtained.
As a further improvement of the invention, the intelligent agent-1 can only receive the message sent by the intelligent agent-N and simultaneously send the message to the intelligent agent-2; the agent-N can only receive the message sent by the agent- (N-1) and send the message to the agent-1.
As a further improvement of the present invention, the step S3 specifically comprises the following steps:
including step S3-1 to step S3-6;
step S3-1, the agent-i observes the environment to obtain the observed value
Figure BDA0002780097540000038
Step S3-2, agent-i uses a communication policy
Figure BDA0002780097540000039
Calculating to obtain a message
Figure BDA00027800975400000310
Then send the message to agent- (i + 1);
step S3-3, agent-i receives message
Figure BDA00027800975400000311
And by control strategies
Figure BDA00027800975400000312
Get a control action
Figure BDA00027800975400000313
Then the action is executed;
step S3-4, the agent-i observes the environment to obtain the observation value of the next step
Figure BDA00027800975400000314
Obtaining feedback reward value scalar r given by external environmentt
Step S3-5, saving the observation value set of all agents from step S3-1 to S3-4
Figure BDA00027800975400000315
Set of next step observations
Figure BDA0002780097540000041
Set of action values
Figure BDA0002780097540000042
And feeding back the prize value rtCombining three sets and a scalar into one experience tuple, denoted as ek=(Ot,At,rt,O′t)kAnd storing the data into an empirical data memory, wherein a subscript k represents a sequence number of the group of data in the empirical data memory;
step S3-6, repeatedly executing S3-1 to step S3-5 until sufficient interaction experience data is obtained.
As a further improvement of the invention, the specific steps of step S5 are as follows;
including step S5-1 to step S5-3;
step S5-1, calculate the gradient of the global evaluator loss function l (w) according to the following formula:
Figure BDA0002780097540000043
wherein y isj=rj+γQw′(O′j,A′j) Here, the calculation uses the target network of the global evaluator neural network, where γ is the attenuation factorIn the case of a hybrid vehicle,
Figure BDA0002780097540000044
is a gradient symbol of'jThe method is obtained through the current control strategy, the observation quantity of the next step and the information, and is calculated by the following formula:
Figure BDA0002780097540000045
observed value o 'of next step of each agent in formula'j1,...,o′jNFrom empirical data O'jThe next message m 'of each agent in the formula is obtained'j1,...,m′jNThe observation value is calculated by taking the communication strategy of each agent into the next step, and the specific calculation method is as follows:
Figure BDA0002780097540000046
step S5-2, updating weight parameters of the global evaluator neural network by using the gradient of the global evaluator loss function, wherein the specific updating method is as follows:
Figure BDA0002780097540000047
wherein arrow ← is the update assignment symbol, and α is the update rate of the network weight parameter;
step S5-3, updating global evaluator neural network Qw′(Ot,At) The weight parameter w' of the target network is specifically updated according to the following formula:
w′←ηw+(1-η)w′
wherein η is the update rate of the network weight parameter.
Has the advantages that:
1) the intelligent agent can extract effective communication information from the original high-dimensional input of the sensing equipment, so that the multi-intelligent-agent system can realize efficient information sharing, and meanwhile, the communication load is reduced;
2) the communication strategy and the control strategy of the intelligent agent are constructed by using the reinforcement learning algorithm, analysis and mathematical modeling are not needed to be carried out on the dynamic characteristics of the intelligent agent, and the complexity of the design of the communication and control strategy of the multi-intelligent-agent system with complex dynamic and high-dimensional observation is reduced.
Drawings
FIG. 1 is a flow chart of the composition structure and information processing of a mobile robot according to the present invention;
fig. 2 is a diagram of a multi-mobile-robot communication network structure of the present invention.
Detailed Description
The invention is described in further detail below with reference to the following detailed description and accompanying drawings:
the invention provides a reinforcement learning algorithm for cooperative communication and control of a multi-agent system, which aims at the multi-agent system for information sharing by sending and receiving messages through a communication network with a certain topological structure, and provides the reinforcement learning algorithm, so that the multi-agent system can construct a communication strategy and a control strategy on each agent through training, the whole multi-agent system can realize efficient information sharing, and finally, the task of cooperative control is completed.
The following describes an embodiment of the method disclosed by the invention for a multi-agent system consisting of three mobile robots and performing cooperative control.
Fig. 1 shows a mobile robot composition and an information processing flow.
101 is a sensing device of the robot for observing external environmental obstacles, self-position information, and the like and converting them into measurement data. The sensing equipment can specifically comprise common mobile robot sensing equipment such as a camera, an ultrasonic distance meter, a laser distance meter, a speedometer, a global positioning system receiver, an ultra-wideband positioning receiver and the like.
102 is a communication device (receive) for receiving a communication message. 106 is a communication device (transmit) for transmitting communication messages. The communication equipment can be common communication equipment on the mobile robot and comprises a Bluetooth communication module, a ZigBee communication module, a WiFi communication module and the like.
And 107, driving equipment, which is used for driving the mobile robot to move according to the input control quantity, can be common mobile robot driving equipment and can comprise equipment such as a motor, wheels, a crawler belt, a steering engine and the like.
Reference numeral 103 denotes an information processing computer mounted on the mobile robot, which processes information input by the sensor device and the communication device (reception) and transmits the processed information to the driver device and the communication device (transmission). The information processing computer includes two functional modules: 104 is a communication policy module and 105 is a control policy module.
The specific information processing mode of the communication strategy module is as follows: and inputting the measurement data obtained by the sensing equipment into a communication strategy neural network for operation to obtain a sending message, transmitting the sending message to the communication equipment (sending), and then sending the sending message to another mobile robot.
The specific information processing mode of the control strategy module is as follows: and inputting the measurement data obtained by the sensing equipment and the message received by the communication equipment (receiving) into a control strategy neural network for operation, obtaining a control quantity and transmitting the control quantity to the driving equipment to control the movement of the mobile robot.
Fig. 2 is a diagram of a multi-mobile robot communication network architecture. The network is divided into two parts: a communication network between the mobile robot and the mobile robot (indicated by solid arrows in fig. 2), and a communication network between the mobile robot and the central training computer (indicated by dashed arrows in fig. 2).
The specific communication mode of the communication network between the mobile robots is as follows: the first mobile robot receives the message sent by the third mobile robot and sends the message to the second mobile robot; the second mobile robot receives the message sent by the first mobile robot and sends the message to the third mobile robot; the third mobile robot receives the message sent by the second mobile robot and sends the message to the second mobile robot.
The specific communication mode of the communication network between the mobile robot and the central training computer is as follows: and each mobile robot launches the measured data, the control action, the weight parameters of the control strategy neural network and the communication neural network to the central training computer, and the central training computer correspondingly sends the updated weight parameters of the control strategy neural network and the communication neural network back to each mobile robot.
The specific execution mode of the reinforcement learning algorithm for cooperative communication and control of the multi-agent system on the cooperative control task of the three mobile robots is given by the following steps:
step S1, constructing a global evaluator on the central training computer, wherein the global evaluator is represented by a neural network and is represented as Qw(Ot,At) Wherein w is all weight parameters of the neural network,
Figure BDA0002780097540000061
represents a set of observations of three mobile robots at time t,
Figure BDA0002780097540000062
a set of control operations of the three mobile robots at time t is shown. Respectively constructing communication strategy neural network on information processing computer of each mobile robot
Figure BDA0002780097540000063
And control strategy neural network
Figure BDA0002780097540000064
And step S2, initializing the global evaluator neural network, the three mobile robot communication strategy neural networks and the control strategy neural network by using random numbers. Additionally constructing a target network of a global evaluator neural network, denoted as Qw′(Ot,At) The network weight parameter is w', the neural network structure is completely the same as the global evaluator neural network, and the initial weight parameter is also completely the same.
In step S3, the three mobile robots observe the environment, communicate with each other, and communicate with the central training computer. The concrete steps include step S3-1 to step S3-5.
Step S3-1, the three mobile robots respectively observe the environment by using the sensing equipment, and the first mobile robot obtains the observed quantity
Figure BDA0002780097540000065
The second mobile robot obtains the observed quantity
Figure BDA0002780097540000066
The third mobile robot obtains the observed quantity
Figure BDA0002780097540000067
Step S3-2, the first mobile robot calculates to obtain a message by using a communication strategy module
Figure BDA0002780097540000068
Then sending the message to a second mobile robot using the communication device; the second mobile robot calculates to obtain a message by using the communication strategy module
Figure BDA0002780097540000069
Then sending the message to a third mobile robot using the communication device; the third mobile robot obtains the message by calculation through the communication strategy module
Figure BDA00027800975400000610
Then sending the message to the first mobile robot using the communication device; the calculation and transmission of the three mobile robots in this step may be performed simultaneously.
Step S3-3, the first mobile robot receives the message using the communication device
Figure BDA00027800975400000611
And obtains control action through a control strategy module
Figure BDA00027800975400000612
The second mobile robot receiving the message using the communication device
Figure BDA00027800975400000613
And obtains control action through a control strategy module
Figure BDA00027800975400000614
The third mobile robot receiving the message using the communication device
Figure BDA00027800975400000615
And obtains control action through a control strategy module
Figure BDA00027800975400000616
Three mobile robots respectively control the actions
Figure BDA0002780097540000071
An input driving device that drives a motion of the mobile robot; in this step, the three mobile robots receiving the message, calculating the control action, and inputting the control action to the driving device may be performed simultaneously.
Step S3-4, the three mobile robots respectively use the sensing equipment to observe the environment to obtain the environment observed value of the next step,
Figure BDA0002780097540000072
while obtaining the feedback prize value scalar r given by the external environmentt
At step S3-5, the three mobile robots transmit the environment observation values, the control action values, and the feedback award values obtained at steps S3-1 to S3-4 to the central training computer through the communication device.
Step S3-6, the central training computer receives the environmental observation value sets of three mobile robots
Figure BDA0002780097540000073
Set of next step observations
Figure BDA0002780097540000074
Control action set
Figure BDA0002780097540000075
Feedback prize value rt. The received data is merged into an experience tuple denoted as ek=(Ot,At,rt,O′t)kAnd saves the tuple to the empirical data store of the central training computer.
Step S3-7, repeating steps S3-1 to S3-6 until sufficient interactive experience data (experience tuple quantity >10000 pieces) is saved by the central training computer.
Step S4, randomly extracting K sets of empirical data from the empirical data storage on the central training computer: [ e ] a1,e2,...,eK]Any set of empirical data extracted is denoted as ej=(Oj,Aj,rj,O′j) Wherein the subscript j represents the sequence number in the extracted K sets of data, wherein OjCan be unfolded and expressed as Oj={oj1,oj2,oj3}, likewise, Aj={aj1,aj2,aj3}、O′j={o′j1,o′j2,o′j3}。
Step S5, training the global evaluator Q on the central training computer using the extracted empirical dataw(Ot,At) Specifically, the method includes steps S5-1 to S5-3.
Step S5-1, calculate the gradient of the global evaluator loss function l (w) according to the following formula:
Figure BDA0002780097540000076
wherein y isj=rj+γQw′(O′j,A′j) Here, the target network of the global evaluator neural network is used for the calculation. Wherein gamma is a decay factor, wherein gamma is a linear function,
Figure BDA0002780097540000077
is a sign of a gradient, whichOf medium A'jThe method is obtained through the current control strategy, the observation quantity of the next step and the information, and is calculated by the following formula:
Figure BDA00027800975400000710
observed values o 'of next steps of three mobile robots in formula'j1,o′j2,o′j3From empirical data O'jAnd (4) obtaining. Message m 'of next step of three mobile robots in formula'j1,m′j2,m′j3And carrying out calculation on the observation value of the next step by the communication strategy of each mobile robot, wherein the specific calculation method is as follows:
Figure BDA0002780097540000078
step S5-2, updating weight parameters of the global evaluator neural network by using the gradient of the global evaluator loss function, wherein the specific updating method is as follows:
Figure BDA0002780097540000079
wherein arrow ← is the update assignment symbol, and α is the update rate of the network weight parameter.
Step S5-3, updating global evaluator neural network Qw′(Ot,At) The weight parameter w' of the target network is specifically updated according to the following formula:
w′←ηw+(1-η)w′
wherein η is the update rate of the network weight parameter.
And step S6, calculating weight parameters of the new communication strategy neural network of each mobile robot by using the extracted empirical data. For the first mobile robot, a new communication strategy weight parameter theta'1The following formula is used for calculation:
Figure BDA0002780097540000081
message m in formula2The communication strategy of the first mobile robot is substituted into the corresponding observed value to be calculated, and the calculation method is as follows:
Figure BDA0002780097540000082
in the formula
Figure BDA0002780097540000083
Is a set of control actions, except a2Other actions than these are derived from empirical data AjIs extracted from2Calculated by the following formula:
Figure BDA0002780097540000084
for the second and third mobile robots, a similar method can be used to obtain weight parameter θ 'of the new communication policy neural network'2And θ'3
And step S7, calculating weight parameters of the new control strategy neural network of each mobile robot by using the extracted empirical data. For the first mobile robot, the new control strategy neural network weight parameter mu'iThe following formula is used for calculation:
Figure BDA0002780097540000085
wherein the message m1The communication strategy of the third mobile robot is brought into the corresponding observed value to be calculated, and the calculation method is as follows:
Figure BDA0002780097540000086
in the formula
Figure BDA0002780097540000087
Is a set of control actions, except a1Other actions than these are derived from empirical data AjIs extracted from1Calculated by the following formula:
Figure BDA0002780097540000088
other unexplained symbols in the formula are the same as the symbols in the formula described in step S6. For the second and third mobile robots, a similar method can be used to obtain a new control strategy neural network weight parameter μ'2And mu'3
Step S8, the central training computer calculates the new communication strategy and control strategy neural network parameters theta 'of the first mobile robot'1And mu'1The communication strategy and control strategy neural network is sent to the information processing computer of the first mobile robot through the communication equipment, and the communication strategy and control strategy neural network of the previous step are updated. The communication strategy and control strategy neural networks of the second and third mobile robots are transmitted and updated in the same manner.
And step S9, repeatedly executing the steps S3 to S8 by using the new communication strategy and control strategy, and iteratively updating the communication strategy and control strategy neural networks of the three mobile robots until weight parameters of the communication strategy neural networks and control strategy neural networks of all the mobile robots converge, so that the cooperative communication and control strategy for the three mobile robots is obtained.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, but any modifications or equivalent variations made according to the technical spirit of the present invention are within the scope of the present invention as claimed.

Claims (4)

1. An reinforcement learning algorithm for cooperative communication and control of multi-agent system, characterized in thatThe multi-agent system comprises N agents, wherein any agent is represented as an agent-i, i is the unique serial number of the agent in the multi-agent system, the agent observes the external environment, and the observation result is represented as
Figure FDA0002780097530000011
The real number vector is represented, wherein an upper mark t and a lower mark i represent that an intelligent agent-i observes the external environment at the moment t, the intelligent agent executes control braking to influence the self state and the external environment, and the control action is represented as
Figure FDA0002780097530000012
A real number vector, wherein the superscript t and the subscript i denote the control actions performed by the agent-i at time t, the multi-agent system, communicating using a particular communication network, each agent being capable of sending a message to another determined agent; each intelligent agent can also receive a message sent by a specific intelligent agent, the sending and receiving rules of the messages among the intelligent agents obey a specific communication topological relation, the specific communication topological relation is a directed ring topological structure, and specifically, the intelligent agent-i can only receive the information sent by the intelligent agent- (i-1) and simultaneously sends the message to the intelligent agent- (i + 1);
the message received by agent-i is represented as
Figure FDA0002780097530000013
The subscript i represents the message received by the agent-i at the time t, and the message is also the message sent by the agent- (i-1) at the time t;
the message sent by agent-i is represented as
Figure FDA0002780097530000014
Wherein, the subscript t and the subscript i +1 represent the message sent by the agent-i at the time t, and the message is also the message received by the agent- (i +1) at the time t;
the communication strategy constructed on the agent-i is represented by a neural network, specifically represented as
Figure FDA0002780097530000015
Wherein theta isiThe inputs of the communication strategy are the observed results of the agent-i for all weight parameters of the neural network
Figure FDA0002780097530000016
The output of the communication policy being a transmitted message
Figure FDA0002780097530000017
For agent-N, in particular
Figure FDA0002780097530000018
The control strategy constructed on the agent-i is represented by a neural network, specifically represented as
Figure FDA0002780097530000019
Wherein muiThe inputs of the control strategy are the observed results of the agent-i for all weight parameters of the neural network
Figure FDA00027800975300000110
And the received message
Figure FDA00027800975300000111
The output of the control strategy being a control action
Figure FDA00027800975300000112
The reinforcement learning algorithm for the communication and control of the multi-agent system comprises the following algorithm steps:
step S1, constructing global commentThe valuator, global evaluator is represented using a neural network, specifically denoted as Qw(ot,At) Wherein w is all weight parameters of the neural network,
Figure FDA00027800975300000113
representing the set of observations of all agents of the external environment at time t,
Figure FDA00027800975300000114
represents the set of control actions of all agents at time t;
step S2, initializing the global evaluator neural network, all agent communication strategy neural networks and all agent control strategy neural networks by using random numbers, and additionally constructing a target network of the global evaluator neural network, wherein the target network is represented as Qw′(ot,At) The network weight parameter is w', the neural network structure is completely the same as the global evaluator neural network, and the initial weight parameter is also completely the same;
step S3, each agent and the environment interact in a plurality of time steps, and simultaneously, by sending and receiving messages, the interaction experience data is stored;
step S4, randomly extracting K sets of empirical data from the empirical data memory: [ e ] a1,e2,...,eK]Any set of empirical data extracted is denoted as ej=(Oj,Aj,rj,O′j) Wherein the subscript j represents the sequence number in the extracted K sets of data, wherein OjCan be unfolded and expressed as Oj={oj1,oj2,...,ojN}, likewise, Aj={aj1,aj2,...,ajN}、O′j={o′j1,o′j2,...,o′jN};
Step S5, training global evaluator Q using the extracted empirical dataw(ot,At);
Step S6, training a communication policy for each agent using the extracted empirical dataIn a similar manner, the weight parameter θ of the communication policy network of agent-i is updated using the following formulai
Figure FDA0002780097530000021
Where α is an update rate of a network weight parameter, and a subscript ρ indicates a number of an agent that receives a message sent by an agent-i, specifically, for an agent-1. (N-1), ρ ═ i +1, and for an agent-N, ρ ═ 1, a message m in the formulaρThe communication strategy of the agent-i is substituted into the corresponding observation value to be calculated, and the calculation method is as follows:
Figure FDA0002780097530000022
in the formula
Figure FDA0002780097530000023
Is a set of control actions, except aρOther actions than these are derived from empirical data AjIs extracted fromρCalculated by the following formula:
Figure FDA0002780097530000024
step S7, training the control strategy of each agent by using the extracted empirical data, and updating the weight parameter mu of the control strategy network of the agent-i by using the following formulai
Figure FDA0002780097530000025
Wherein the message miThe communication strategy of the agent- (i-1) is brought into a corresponding observation value to be calculated, and the calculation method is as follows:
Figure FDA0002780097530000026
in the formula
Figure FDA0002780097530000027
Is a set of control actions, except aiOther actions than these are derived from empirical data AjIs extracted fromiCalculated by the following formula:
Figure FDA0002780097530000028
other unexplained symbols in the formula have the same meaning as the symbols in the formula described in step S6;
and S8, repeatedly executing the steps S3 to S7 by using the new communication strategy and the new control strategy, and iteratively updating the communication strategy and the control strategy neural network of each agent until the weight parameters of all the agent communication strategy neural networks and the control strategy neural networks are converged, so that the algorithm can be ended, and the cooperative communication strategy and the control strategy of the multi-agent system are obtained.
2. A reinforcement learning algorithm for multi-agent system cooperative communication and control as claimed in claim 1 wherein agent-1 can only receive messages sent by agent-N and send messages to agent-2 at the same time; the agent-N can only receive the message sent by the agent- (N-1) and send the message to the agent-1.
3. The reinforcement learning algorithm for cooperative communication and control of multi-agent system as claimed in claim 1, wherein the step S3 is specifically as follows:
including step S3-1 to step S3-6;
step S3-1, the agent-i observes the environment to obtain the observed value
Figure FDA0002780097530000031
Step S3-2, agent-i uses a communication policy
Figure FDA0002780097530000032
Calculating to obtain a message
Figure FDA0002780097530000033
Then send the message to agent- (i + 1);
step S3-3, agent-i receives message
Figure FDA0002780097530000034
And by control strategies
Figure FDA0002780097530000035
Get a control action
Figure FDA0002780097530000036
Then the action is executed;
step S3-4, the agent-i observes the environment to obtain the observation value of the next step
Figure FDA0002780097530000037
Obtaining feedback reward value scalar r given by external environmentt
Step S3-5, saving the observation value set of all agents from step S3-1 to S3-4
Figure FDA0002780097530000038
Set of next step observations
Figure FDA0002780097530000039
Set of action values
Figure FDA00027800975300000310
And feeding back the prize value rtCombining three sets with a scalarAre combined into an experience tuple denoted as ek=(Ot,At,rt,O′t) k and saving it to the empirical data memory, wherein the subscript k denotes the sequence number of the set of data in the empirical data memory;
step S3-6, repeatedly executing S3-1 to step S3-5 until sufficient interaction experience data is obtained.
4. The reinforcement learning algorithm for multi-agent system cooperative communication and control as claimed in claim 1, wherein the step S5 is as follows;
including step S5-1 to step S5-3;
step S5-1, calculate the gradient of the global evaluator loss function l (w) according to the following formula:
Figure FDA00027800975300000311
wherein y isj=rj+γQw′(O′j,A′j) The calculation here uses the target network of the global evaluator neural network, where gamma is the attenuation factor,
Figure FDA00027800975300000312
is a gradient symbol of'jThe method is obtained through the current control strategy, the observation quantity of the next step and the information, and is calculated by the following formula:
Figure FDA00027800975300000313
observed value o 'of next step of each agent in formula'j1,...,o′jNFrom empirical data O'jThe next message m 'of each agent in the formula is obtained'j1,...,m′jNThe observation value is calculated by taking the communication strategy of each agent into the next step, and the specific calculation methodThe following formula:
Figure FDA0002780097530000041
step S5-2, updating weight parameters of the global evaluator neural network by using the gradient of the global evaluator loss function, wherein the specific updating method is as follows:
Figure FDA0002780097530000042
wherein arrow ← is the update assignment symbol, and α is the update rate of the network weight parameter;
step S5-3, updating the global evaluator neural network Qw' (O)t,At) The weight parameter w' of the target network is specifically updated according to the following formula:
w′←ηw+(1-η)w′
wherein η is the update rate of the network weight parameter.
CN202011278974.8A 2020-11-16 2020-11-16 Reinforced learning algorithm for cooperative communication and control of multi-agent system Pending CN112434792A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011278974.8A CN112434792A (en) 2020-11-16 2020-11-16 Reinforced learning algorithm for cooperative communication and control of multi-agent system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011278974.8A CN112434792A (en) 2020-11-16 2020-11-16 Reinforced learning algorithm for cooperative communication and control of multi-agent system

Publications (1)

Publication Number Publication Date
CN112434792A true CN112434792A (en) 2021-03-02

Family

ID=74701138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011278974.8A Pending CN112434792A (en) 2020-11-16 2020-11-16 Reinforced learning algorithm for cooperative communication and control of multi-agent system

Country Status (1)

Country Link
CN (1) CN112434792A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966816A (en) * 2021-03-31 2021-06-15 东南大学 Multi-agent reinforcement learning method surrounded by formation
CN115086374A (en) * 2022-06-14 2022-09-20 河南职业技术学院 Scene complexity self-adaptive multi-agent layered cooperation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YUANDA WANG等: "Cooperative control for multi-player pursuit-evasion games with reinforcement learning", 《NEUROCOMPUTING》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966816A (en) * 2021-03-31 2021-06-15 东南大学 Multi-agent reinforcement learning method surrounded by formation
CN115086374A (en) * 2022-06-14 2022-09-20 河南职业技术学院 Scene complexity self-adaptive multi-agent layered cooperation method

Similar Documents

Publication Publication Date Title
Liu et al. Learning for multi-robot cooperation in partially observable stochastic environments with macro-actions
CN112434792A (en) Reinforced learning algorithm for cooperative communication and control of multi-agent system
Yan et al. Collision-avoiding flocking with multiple fixed-wing uavs in obstacle-cluttered environments: A task-specific curriculum-based madrl approach
CN112183288B (en) Multi-agent reinforcement learning method based on model
Devo et al. Autonomous single-image drone exploration with deep reinforcement learning and mixed reality
Geng et al. Learning to cooperate in decentralized multi-robot exploration of dynamic environments
Luo et al. Self-imitation learning by planning
Wang et al. A novel hybrid map based global path planning method
Zhang et al. Multi-robot cooperative target encirclement through learning distributed transferable policy
Zou et al. Mobile robot path planning using improved mayfly optimization algorithm and dynamic window approach
Chen et al. When shall i be empathetic? the utility of empathetic parameter estimation in multi-agent interactions
Asarkaya et al. Temporal-logic-constrained hybrid reinforcement learning to perform optimal aerial monitoring with delivery drones
CN116227622A (en) Multi-agent landmark coverage method and system based on deep reinforcement learning
Nowak et al. Self organized UAV swarm planning optimization for search and destroy using SWARMFARE simulation
Zarrouki Reinforcement learning of model predictive control parameters for autonomous vehicle guidance
Suh et al. Optimal motion planning for multi-modal hybrid locomotion
Kuyucu et al. Incremental evolution of fast moving and sensing simulated snake-like robot with multiobjective GP and strongly-typed crossover
Khalajzadeh et al. A review on applicability of expert system in designing and control of autonomous cars
Zhang et al. Build simulation platform in real logistics scenario and optimization based on reinforcement learning
Wang et al. Path planning for air-ground robot considering modal switching point optimization
Guan Self-inspection method of unmanned aerial vehicles in power plants using deep q-network reinforcement learning
Sidenko et al. Machine Learning for Unmanned Aerial Vehicle Routing on Rough Terrain
CN116644779A (en) Object transportation method and device based on multiple intelligent agents
Shi et al. Leader-follower cooperative movement method for multiple amphibious spherical robots
Feng et al. Overview and Application-Driven Motivations of Evolutionary Multitasking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210302