CN112434792A

CN112434792A - Reinforced learning algorithm for cooperative communication and control of multi-agent system

Info

Publication number: CN112434792A
Application number: CN202011278974.8A
Authority: CN
Inventors: 王远大; 孙长银; 孙佳
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-11-16
Filing date: 2020-11-16
Publication date: 2021-03-02

Abstract

This patent discloses a reinforcement learning method for multi-agent system communication and control. The method provides a reinforcement learning algorithm for a multi-agent system which carries out information sharing by sending and receiving messages through a communication network with a certain topological structure, so that the multi-agent system can construct a communication strategy and a control strategy on each agent through training, the agents extract effective low-dimensional communication information from high-dimensional original input of sensing equipment, and the whole multi-agent system can realize efficient information sharing and cooperative control. The method reduces the complexity of the design of the communication and control strategy of the multi-agent system with complex dynamic and high-dimensional observation, and simultaneously reduces the communication load between agents.

Description

Reinforced learning algorithm for cooperative communication and control of multi-agent system

Technical Field

The invention belongs to the technical field of artificial intelligence and machine learning, and relates to a reinforcement learning algorithm for cooperative communication and control of a multi-agent system.

Background

A multi-agent system is made up of a plurality of interacting agents, each agent having certain sensing, computing and execution capabilities and being able to communicate with other agents via a communication network. The objective of the multi-agent cooperative communication and control is to enable a plurality of agents to form mutual cooperation through designing reasonable communication and control strategies, and further to complete tasks which are difficult to be independently completed by a single agent. The intelligent agents in the multi-agent system can be embodied into different entities in practical application in different fields, such as aircrafts, mobile robots, traffic lights, power network nodes and the like, and can play an important role in the applications of aircraft formation flying, mobile robot cooperative carrying, urban traffic network intelligent control, intelligent power grid system control and the like after reasonable communication and control strategies are given. The existing multi-agent control algorithm establishes a differential equation model of an agent by analyzing the kinematics of the agent, designs a corresponding communication protocol and a controller based on a control theory, and finally completes the problems of consistency, formation, optimization and the like of the agent. However, when the kinematics of the agent is complex and the dimension of the sensing data is high, it is difficult to describe the agent by using a differential equation, so that the agent cannot be controlled by using the existing method based on the control theory.

Disclosure of Invention

In view of the above problems, the present invention provides a reinforcement learning algorithm for multi-agent system cooperative communication and control, which provides a reinforcement learning algorithm for multi-agent system information sharing by sending and receiving messages through a communication network with a certain topology, so that the multi-agent system can construct a communication strategy and a control strategy on each agent through training, so that the whole multi-agent system can realize efficient information sharing and finally complete the task of cooperative control, the present invention provides a reinforcement learning algorithm for multi-agent system cooperative communication and control, characterized in that the multi-agent system comprises N agents, wherein any agent is represented as agent-i, i is the unique sequence number of the agent in the multi-agent system, and the agent observes the external environment, the observation results are expressed as

Is a real number vector, wherein the superscript t and the subscript i represent the agent-i, observing the external environment at the moment t, wherein the intelligent agent executes control braking to influence the self state and the external environment, and the control action is expressed as

A real number vector, wherein the superscript t and the subscript i denote the control actions performed by the agent-i at time t, the multi-agent system, communicating using a particular communication network, each agent being capable of sending a message to another determined agent; each intelligent agent can also receive a message sent by a specific intelligent agent, the sending and receiving rules of the messages among the intelligent agents obey a specific communication topological relation, the specific communication topological relation is a directed ring topological structure, and specifically, the intelligent agent-i can only receive the information sent by the intelligent agent- (i-1) and simultaneously sends the message to the intelligent agent- (i + 1);

the message received by agent-i is represented as

The subscript i represents the message received by the agent-i at the time t, and the message is also the message sent by the agent- (i-1) at the time t;

the message sent by agent-i is represented as

Wherein, the subscript t and the subscript i +1 represent the message sent by the agent-i at the time t, and the message is also the message received by the agent- (i +1) at the time t;

the communication strategy constructed on the agent-i is represented by a neural network, specifically represented as

Wherein theta is_iThe inputs of the communication strategy are the observed results of the agent-i for all weight parameters of the neural network

The output of the communication policy being a transmitted message

For agent-N, in particular

The control strategy constructed on the agent-i is represented by a neural network, specifically represented as

Wherein mu_iThe inputs of the control strategy are the observed results of the agent-i for all weight parameters of the neural network

And the received message

The output of the control strategy being a control action

The reinforcement learning algorithm for the communication and control of the multi-agent system comprises the following algorithm steps:

step S1, constructing a global evaluator, wherein the global evaluator is represented by a neural network, and is specifically represented as Q^w(O_t，A_t) Wherein w is all weight parameters of the neural network,

representing the set of observations of all agents of the external environment at time t,

represents the set of control actions of all agents at time t;

step S2, initializing the global evaluator neural network, all agent communication strategy neural networks and all agent control strategy neural networks by using random numbers, and additionally constructing a target network of the global evaluator neural network, wherein the target network is represented as Q^w′(O_t，A_t) The network weight parameter is w', the neural network structure is completely the same as the global evaluator neural network, and the initial weight parameter is also completely the same;

step S3, each agent and the environment interact in a plurality of time steps, and simultaneously, by sending and receiving messages, the interaction experience data is stored;

step S4, randomly extracting K sets of empirical data from the empirical data memory: [ e ] a₁，e₂，...，e_K]Any set of empirical data extracted is denoted as e_j＝(O_j，A_j，r_j，O′_j) Wherein the subscript j represents the sequence number in the extracted K sets of data, wherein O_jCan be unfolded and expressed as O_j＝{o_j1，o_j2，...，o_jN}, likewise, A_j＝{a_j1，a_j2，...，a_jN}、O′_j＝{o′_j1，o′_j2，...，o′_jN}；

Step S5, training global evaluator Q using the extracted empirical data^w(O_t，A_t)；

Step S6, training the communication strategy of each agent by using the extracted empirical data, and updating the weight parameter theta of the communication strategy network of the agent-i by using the following formula_i：

Where α is the update rate of the network weight parameter, and the subscript ρ represents the number of the agent receiving the message sent by the agent-i, specifically, for agent-1 … (N-1), ρ ═ i +1, and for agent-N, ρ ═ 1, message m in the formula_ρBy agents-iSubstituting the communication strategy into the corresponding observed value to obtain the communication strategy through calculation, wherein the calculation method is as follows:

in the formula

Is a set of control actions, except a_ρOther actions than these are derived from empirical data A_jIs extracted from_ρCalculated by the following formula:

step S7, training the control strategy of each agent by using the extracted empirical data, and updating the weight parameter mu of the control strategy network of the agent-i by using the following formula_i：

Wherein the message m_iThe communication strategy of the agent- (i-1) is brought into a corresponding observation value to be calculated, and the calculation method is as follows:

in the formula

Is a set of control actions, except a_iOther actions than these are derived from empirical data A_jIs extracted from_iCalculated by the following formula:

other unexplained symbols in the formula have the same meaning as the symbols in the formula described in step S6;

and S8, repeatedly executing the steps S3 to S7 by using the new communication strategy and the new control strategy, and iteratively updating the communication strategy and the control strategy neural network of each agent until the weight parameters of all the agent communication strategy neural networks and the control strategy neural networks are converged, so that the algorithm can be ended, and the cooperative communication strategy and the control strategy of the multi-agent system are obtained.

As a further improvement of the invention, the intelligent agent-1 can only receive the message sent by the intelligent agent-N and simultaneously send the message to the intelligent agent-2; the agent-N can only receive the message sent by the agent- (N-1) and send the message to the agent-1.

As a further improvement of the present invention, the step S3 specifically comprises the following steps:

including step S3-1 to step S3-6;

step S3-1, the agent-i observes the environment to obtain the observed value

Step S3-2, agent-i uses a communication policy

Calculating to obtain a message

Then send the message to agent- (i + 1);

step S3-3, agent-i receives message

And by control strategies

Get a control action

Then the action is executed;

step S3-4, the agent-i observes the environment to obtain the observation value of the next step

Obtaining feedback reward value scalar r given by external environment_t；

Step S3-5, saving the observation value set of all agents from step S3-1 to S3-4

Set of next step observations

Set of action values

And feeding back the prize value r_tCombining three sets and a scalar into one experience tuple, denoted as e_k＝(O_t，A_t，r_t，O′_t)_kAnd storing the data into an empirical data memory, wherein a subscript k represents a sequence number of the group of data in the empirical data memory;

step S3-6, repeatedly executing S3-1 to step S3-5 until sufficient interaction experience data is obtained.

As a further improvement of the invention, the specific steps of step S5 are as follows;

including step S5-1 to step S5-3;

step S5-1, calculate the gradient of the global evaluator loss function l (w) according to the following formula:

wherein y is_j＝r_j+γQ^w′(O′_j，A′_j) Here, the calculation uses the target network of the global evaluator neural network, where γ is the attenuation factorIn the case of a hybrid vehicle,

is a gradient symbol of'_jThe method is obtained through the current control strategy, the observation quantity of the next step and the information, and is calculated by the following formula:

observed value o 'of next step of each agent in formula'_j1，...，o′_jNFrom empirical data O'_jThe next message m 'of each agent in the formula is obtained'_j1，...，m′_jNThe observation value is calculated by taking the communication strategy of each agent into the next step, and the specific calculation method is as follows:

step S5-2, updating weight parameters of the global evaluator neural network by using the gradient of the global evaluator loss function, wherein the specific updating method is as follows:

wherein arrow ← is the update assignment symbol, and α is the update rate of the network weight parameter;

step S5-3, updating global evaluator neural network Q^w′(O_t，A_t) The weight parameter w' of the target network is specifically updated according to the following formula:

w′←ηw+(1-η)w′

wherein η is the update rate of the network weight parameter.

Has the advantages that:

1) the intelligent agent can extract effective communication information from the original high-dimensional input of the sensing equipment, so that the multi-intelligent-agent system can realize efficient information sharing, and meanwhile, the communication load is reduced;

2) the communication strategy and the control strategy of the intelligent agent are constructed by using the reinforcement learning algorithm, analysis and mathematical modeling are not needed to be carried out on the dynamic characteristics of the intelligent agent, and the complexity of the design of the communication and control strategy of the multi-intelligent-agent system with complex dynamic and high-dimensional observation is reduced.

Drawings

FIG. 1 is a flow chart of the composition structure and information processing of a mobile robot according to the present invention;

fig. 2 is a diagram of a multi-mobile-robot communication network structure of the present invention.

Detailed Description

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

the invention provides a reinforcement learning algorithm for cooperative communication and control of a multi-agent system, which aims at the multi-agent system for information sharing by sending and receiving messages through a communication network with a certain topological structure, and provides the reinforcement learning algorithm, so that the multi-agent system can construct a communication strategy and a control strategy on each agent through training, the whole multi-agent system can realize efficient information sharing, and finally, the task of cooperative control is completed.

The following describes an embodiment of the method disclosed by the invention for a multi-agent system consisting of three mobile robots and performing cooperative control.

Fig. 1 shows a mobile robot composition and an information processing flow.

101 is a sensing device of the robot for observing external environmental obstacles, self-position information, and the like and converting them into measurement data. The sensing equipment can specifically comprise common mobile robot sensing equipment such as a camera, an ultrasonic distance meter, a laser distance meter, a speedometer, a global positioning system receiver, an ultra-wideband positioning receiver and the like.

102 is a communication device (receive) for receiving a communication message. 106 is a communication device (transmit) for transmitting communication messages. The communication equipment can be common communication equipment on the mobile robot and comprises a Bluetooth communication module, a ZigBee communication module, a WiFi communication module and the like.

And 107, driving equipment, which is used for driving the mobile robot to move according to the input control quantity, can be common mobile robot driving equipment and can comprise equipment such as a motor, wheels, a crawler belt, a steering engine and the like.

Reference numeral 103 denotes an information processing computer mounted on the mobile robot, which processes information input by the sensor device and the communication device (reception) and transmits the processed information to the driver device and the communication device (transmission). The information processing computer includes two functional modules: 104 is a communication policy module and 105 is a control policy module.

The specific information processing mode of the communication strategy module is as follows: and inputting the measurement data obtained by the sensing equipment into a communication strategy neural network for operation to obtain a sending message, transmitting the sending message to the communication equipment (sending), and then sending the sending message to another mobile robot.

The specific information processing mode of the control strategy module is as follows: and inputting the measurement data obtained by the sensing equipment and the message received by the communication equipment (receiving) into a control strategy neural network for operation, obtaining a control quantity and transmitting the control quantity to the driving equipment to control the movement of the mobile robot.

Fig. 2 is a diagram of a multi-mobile robot communication network architecture. The network is divided into two parts: a communication network between the mobile robot and the mobile robot (indicated by solid arrows in fig. 2), and a communication network between the mobile robot and the central training computer (indicated by dashed arrows in fig. 2).

The specific communication mode of the communication network between the mobile robots is as follows: the first mobile robot receives the message sent by the third mobile robot and sends the message to the second mobile robot; the second mobile robot receives the message sent by the first mobile robot and sends the message to the third mobile robot; the third mobile robot receives the message sent by the second mobile robot and sends the message to the second mobile robot.

The specific communication mode of the communication network between the mobile robot and the central training computer is as follows: and each mobile robot launches the measured data, the control action, the weight parameters of the control strategy neural network and the communication neural network to the central training computer, and the central training computer correspondingly sends the updated weight parameters of the control strategy neural network and the communication neural network back to each mobile robot.

The specific execution mode of the reinforcement learning algorithm for cooperative communication and control of the multi-agent system on the cooperative control task of the three mobile robots is given by the following steps:

step S1, constructing a global evaluator on the central training computer, wherein the global evaluator is represented by a neural network and is represented as Q^w(O_t，A_t) Wherein w is all weight parameters of the neural network,

represents a set of observations of three mobile robots at time t,

a set of control operations of the three mobile robots at time t is shown. Respectively constructing communication strategy neural network on information processing computer of each mobile robot

And control strategy neural network

And step S2, initializing the global evaluator neural network, the three mobile robot communication strategy neural networks and the control strategy neural network by using random numbers. Additionally constructing a target network of a global evaluator neural network, denoted as Q^w′(O_t，A_t) The network weight parameter is w', the neural network structure is completely the same as the global evaluator neural network, and the initial weight parameter is also completely the same.

In step S3, the three mobile robots observe the environment, communicate with each other, and communicate with the central training computer. The concrete steps include step S3-1 to step S3-5.

Step S3-1, the three mobile robots respectively observe the environment by using the sensing equipment, and the first mobile robot obtains the observed quantity

The second mobile robot obtains the observed quantity

The third mobile robot obtains the observed quantity

Step S3-2, the first mobile robot calculates to obtain a message by using a communication strategy module

Then sending the message to a second mobile robot using the communication device; the second mobile robot calculates to obtain a message by using the communication strategy module

Then sending the message to a third mobile robot using the communication device; the third mobile robot obtains the message by calculation through the communication strategy module

Then sending the message to the first mobile robot using the communication device; the calculation and transmission of the three mobile robots in this step may be performed simultaneously.

Step S3-3, the first mobile robot receives the message using the communication device

And obtains control action through a control strategy module

The second mobile robot receiving the message using the communication device

And obtains control action through a control strategy module

The third mobile robot receiving the message using the communication device

And obtains control action through a control strategy module

Three mobile robots respectively control the actions

An input driving device that drives a motion of the mobile robot; in this step, the three mobile robots receiving the message, calculating the control action, and inputting the control action to the driving device may be performed simultaneously.

Step S3-4, the three mobile robots respectively use the sensing equipment to observe the environment to obtain the environment observed value of the next step,

while obtaining the feedback prize value scalar r given by the external environment_t。

At step S3-5, the three mobile robots transmit the environment observation values, the control action values, and the feedback award values obtained at steps S3-1 to S3-4 to the central training computer through the communication device.

Step S3-6, the central training computer receives the environmental observation value sets of three mobile robots

Set of next step observations

Control action set

Feedback prize value r_t. The received data is merged into an experience tuple denoted as e_k＝(O_t，A_t，r_t，O′_t)_kAnd saves the tuple to the empirical data store of the central training computer.

Step S3-7, repeating steps S3-1 to S3-6 until sufficient interactive experience data (experience tuple quantity >10000 pieces) is saved by the central training computer.

Step S4, randomly extracting K sets of empirical data from the empirical data storage on the central training computer: [ e ] a₁，e₂，...，e_K]Any set of empirical data extracted is denoted as e_j＝(O_j，A_j，r_j，O′_j) Wherein the subscript j represents the sequence number in the extracted K sets of data, wherein O_jCan be unfolded and expressed as O_j＝{o_j1，o_j2，o_j3}, likewise, A_j＝{a_j1，a_j2，a_j3}、O′_j＝{o′_j1，o′_j2，o′_j3}。

Step S5, training the global evaluator Q on the central training computer using the extracted empirical data^w(O_t，A_t) Specifically, the method includes steps S5-1 to S5-3.

wherein y is_j＝r_j+γQ^w′(O′_j，A′_j) Here, the target network of the global evaluator neural network is used for the calculation. Wherein gamma is a decay factor, wherein gamma is a linear function,

is a sign of a gradient, whichOf medium A'_jThe method is obtained through the current control strategy, the observation quantity of the next step and the information, and is calculated by the following formula:

observed values o 'of next steps of three mobile robots in formula'_j1，o′_j2，o′_j3From empirical data O'_jAnd (4) obtaining. Message m 'of next step of three mobile robots in formula'_j1，m′_j2，m′_j3And carrying out calculation on the observation value of the next step by the communication strategy of each mobile robot, wherein the specific calculation method is as follows:

wherein arrow ← is the update assignment symbol, and α is the update rate of the network weight parameter.

w′←ηw+(1-η)w′

wherein η is the update rate of the network weight parameter.

And step S6, calculating weight parameters of the new communication strategy neural network of each mobile robot by using the extracted empirical data. For the first mobile robot, a new communication strategy weight parameter theta'₁The following formula is used for calculation:

message m in formula₂The communication strategy of the first mobile robot is substituted into the corresponding observed value to be calculated, and the calculation method is as follows:

in the formula

Is a set of control actions, except a₂Other actions than these are derived from empirical data A_jIs extracted from₂Calculated by the following formula:

for the second and third mobile robots, a similar method can be used to obtain weight parameter θ 'of the new communication policy neural network'₂And θ'₃。

And step S7, calculating weight parameters of the new control strategy neural network of each mobile robot by using the extracted empirical data. For the first mobile robot, the new control strategy neural network weight parameter mu'_iThe following formula is used for calculation:

wherein the message m₁The communication strategy of the third mobile robot is brought into the corresponding observed value to be calculated, and the calculation method is as follows:

in the formula

Is a set of control actions, except a₁Other actions than these are derived from empirical data A_jIs extracted from₁Calculated by the following formula:

other unexplained symbols in the formula are the same as the symbols in the formula described in step S6. For the second and third mobile robots, a similar method can be used to obtain a new control strategy neural network weight parameter μ'₂And mu'₃。

Step S8, the central training computer calculates the new communication strategy and control strategy neural network parameters theta 'of the first mobile robot'₁And mu'₁The communication strategy and control strategy neural network is sent to the information processing computer of the first mobile robot through the communication equipment, and the communication strategy and control strategy neural network of the previous step are updated. The communication strategy and control strategy neural networks of the second and third mobile robots are transmitted and updated in the same manner.

And step S9, repeatedly executing the steps S3 to S8 by using the new communication strategy and control strategy, and iteratively updating the communication strategy and control strategy neural networks of the three mobile robots until weight parameters of the communication strategy neural networks and control strategy neural networks of all the mobile robots converge, so that the cooperative communication and control strategy for the three mobile robots is obtained.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, but any modifications or equivalent variations made according to the technical spirit of the present invention are within the scope of the present invention as claimed.

Claims

1. An reinforcement learning algorithm for cooperative communication and control of multi-agent system, characterized in thatThe multi-agent system comprises N agents, wherein any agent is represented as an agent-i, i is the unique serial number of the agent in the multi-agent system, the agent observes the external environment, and the observation result is represented as

The real number vector is represented, wherein an upper mark t and a lower mark i represent that an intelligent agent-i observes the external environment at the moment t, the intelligent agent executes control braking to influence the self state and the external environment, and the control action is represented as

the message received by agent-i is represented as

the message sent by agent-i is represented as

The output of the communication policy being a transmitted message

For agent-N, in particular

And the received message

The output of the control strategy being a control action

step S1, constructing global commentThe valuator, global evaluator is represented using a neural network, specifically denoted as Q^w(o_t，A_t) Wherein w is all weight parameters of the neural network,

represents the set of control actions of all agents at time t;

Step S6, training a communication policy for each agent using the extracted empirical dataIn a similar manner, the weight parameter θ of the communication policy network of agent-i is updated using the following formula_i：

Where α is an update rate of a network weight parameter, and a subscript ρ indicates a number of an agent that receives a message sent by an agent-i, specifically, for an agent-1. (N-1), ρ ═ i +1, and for an agent-N, ρ ═ 1, a message m in the formula_ρThe communication strategy of the agent-i is substituted into the corresponding observation value to be calculated, and the calculation method is as follows:

in the formula

in the formula

2. A reinforcement learning algorithm for multi-agent system cooperative communication and control as claimed in claim 1 wherein agent-1 can only receive messages sent by agent-N and send messages to agent-2 at the same time; the agent-N can only receive the message sent by the agent- (N-1) and send the message to the agent-1.

3. The reinforcement learning algorithm for cooperative communication and control of multi-agent system as claimed in claim 1, wherein the step S3 is specifically as follows:

including step S3-1 to step S3-6;

step S3-1, the agent-i observes the environment to obtain the observed value

Step S3-2, agent-i uses a communication policy

Calculating to obtain a message

Then send the message to agent- (i + 1);

step S3-3, agent-i receives message

And by control strategies

Get a control action

Then the action is executed;

Obtaining feedback reward value scalar r given by external environment_t；

Set of next step observations

Set of action values

And feeding back the prize value r_tCombining three sets with a scalarAre combined into an experience tuple denoted as e_k＝(O_t，A_t，r_t，O′_t) k and saving it to the empirical data memory, wherein the subscript k denotes the sequence number of the set of data in the empirical data memory;

4. The reinforcement learning algorithm for multi-agent system cooperative communication and control as claimed in claim 1, wherein the step S5 is as follows;

including step S5-1 to step S5-3;

wherein y is_j＝r_j+γQ^w′(O′_j，A′_j) The calculation here uses the target network of the global evaluator neural network, where gamma is the attenuation factor,

observed value o 'of next step of each agent in formula'_j1，...，o′_jNFrom empirical data O'_jThe next message m 'of each agent in the formula is obtained'_j1，...，m′_jNThe observation value is calculated by taking the communication strategy of each agent into the next step, and the specific calculation methodThe following formula:

step S5-3, updating the global evaluator neural network Qw' (O)_t，A_t) The weight parameter w' of the target network is specifically updated according to the following formula:

w′←ηw+(1-η)w′

wherein η is the update rate of the network weight parameter.