CN113286275A

CN113286275A - Unmanned aerial vehicle cluster efficient communication method based on multi-agent reinforcement learning

Info

Publication number: CN113286275A
Application number: CN202110441049.0A
Authority: CN
Inventors: 俞扬; 詹德川; 周志华; 练娅莉; 袁雷; 秦熔均; 庞竟成; 管聪; 罗凡明; 张云天; 陈雄辉
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-08-20

Abstract

The invention discloses an unmanned aerial vehicle cluster high-efficiency communication method based on multi-agent reinforcement learning, which comprises the steps of constructing an unmanned aerial vehicle flight environment simulator; randomly selecting an unmanned aerial vehicle as a captain and marking; each unmanned aerial vehicle acquires and maintains local observed values of the unmanned aerial vehicle, encodes the observed values and sends the encoded observed values to the captain; the captain performs attention mechanism processing on the global observation value according to the observation value of each unmanned aerial vehicle, determines the weight of the information according to the importance degree of the information, and then sends the calculated observation value to each teammate as the global observation value of the teammate; in the training stage, the global observation value is used as training data until the strategy network is converged; the execution phase is carried out in a distributed mode; the captain is given an additional reward for survival. The unmanned aerial vehicle cluster centralized information interaction method and the unmanned aerial vehicle cluster centralized information interaction system can solve the problem of unmanned aerial vehicle cluster centralized information interaction under the condition of low communication overhead, and give the unmanned aerial vehicle autonomous decision-making right.

Description

Unmanned aerial vehicle cluster efficient communication method based on multi-agent reinforcement learning

Technical Field

The invention relates to an unmanned aerial vehicle cluster learning method and an unmanned aerial vehicle cluster information interaction method based on a multi-agent reinforcement learning algorithm, and belongs to the technical field of unmanned aerial vehicle cluster communication cooperation.

Background

With the rapid development of science and technology, unmanned aerial vehicles are more and more miniaturized and intelligentized, so that the unmanned aerial vehicles are widely applied to actions such as battlefield investigation, joint attack, emergency rescue and the like. Compared with a single unmanned aerial vehicle, the unmanned aerial vehicle cluster has obvious scale advantages, synergetic advantages and the like, and the limitation of the single unmanned aerial vehicle in the aspects of efficiency and endurance is effectively solved. At present, most unmanned aerial vehicle cluster solutions rely on deep reinforcement learning. Deep reinforcement learning is a branch of machine learning, and is a learning paradigm of interactive trial and error learning with the environment. The reinforcement learning has strong robustness, the core of the reinforcement learning is iterative learning through interaction with the environment, and two basic characteristics of the reinforcement learning are as follows: continuously interacting with the environment to perform trial and error exploration and obtaining delayed return after interacting with the environment. The goal of reinforcement learning is to acquire new knowledge in an iterative process, thereby improving the state-action function to adapt to the environment.

No matter the unmanned aerial vehicle cluster is collaborative training or co-deployment, the communication scheme among all unmanned aerial vehicles is an important component of the unmanned aerial vehicle cluster solution. The existing control strategy for unmanned aerial vehicle cluster information interaction generally comprises three modes, namely centralized control, distributed control and distributed control, wherein each mode has advantages and disadvantages. The centralized control effect is the best, but a large amount of information interaction is needed, the calculation amount is large, the communication efficiency is low, and the unmanned aerial vehicle cluster is often lack of flexibility and autonomy under the method.

In order to improve the communication efficiency among unmanned aerial vehicle clusters, the academic community proposes a plurality of methods, such as a centralized training and decentralized execution framework, but the method ignores the security of a physical layer. The centralized training and decentralized execution framework comprises an Actor network and a Critic network, action strategies of other agents are considered for reinforcement learning of each agent, centralized training and distributed execution are carried out, the Critic network can obtain global information in the training process, and input of the Actor in the execution process comprises local information of a single agent. The frame of decentralized training and decentralized execution is applied to unmanned aerial vehicle cluster control, so that the unmanned aerial vehicle cluster can be prevented from losing autonomy, and each unmanned aerial vehicle can make a decision according to local information acquired by a sensor of the unmanned aerial vehicle, so that certain autonomy capability is realized. However, in the cluster of unmanned aerial vehicles based on a frame for decentralized training and decentralized execution, if each unmanned aerial vehicle needs to interact information such as position, speed, posture and moving object with all unmanned aerial vehicles in the formation, bandwidth is wasted, and information is easily acquired by enemies.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects of low communication efficiency and unsafe communication in an unmanned aerial vehicle cluster based on a centralized training and decentralized execution framework, the invention provides an unmanned aerial vehicle cluster high-efficiency communication method based on a multi-agent deep reinforcement learning framework based on the background of the multi-agent reinforcement learning algorithm in the unmanned aerial vehicle cluster. In the cluster, a 'captain-teammate' mode is adopted, a specific reward mechanism is designed for the captain, the longest survival time of the captain is guaranteed, then the unmanned aerial vehicle cluster is numbered, one captain is selected, and each unmanned aerial vehicle acquires and maintains a local observation value of the unmanned aerial vehicle through a sensor of the unmanned aerial vehicle. In order to reduce information dimension, each unmanned aerial vehicle carries out embedding coding on the local observed value, and the team leader collects the local observed values of teammates and maintains the local observed values as global observed values; in order to reduce the search space and improve the compactness among the information of the global observation values, the captain performs attention mechanism processing on the global observation values according to the own observation values of each unmanned aerial vehicle, and sends the calculated observation values to each teammate as the global observation values of the teammates. The scheme of the invention reduces the communication frequency from N (N-1) to 2(N-1) (N is the number of unmanned racks), greatly reduces communication links and communication traffic, and can better solve the problems of communication efficiency and safety.

The technical scheme is as follows: an unmanned aerial vehicle cluster efficient communication method based on multi-agent reinforcement learning comprises the following steps: (1) constructing an unmanned aerial vehicle flight environment simulator; (2) in the unmanned aerial vehicle cluster, randomly selecting one unmanned aerial vehicle as a team leader and marking, and taking the rest unmanned aerial vehicles as teammates; (3) the team leader is an observation value transfer station, collects local observation values of the team members and maintains the local observation values into a global observation value, and sends the global observation value to team friends for information interaction; (4) performing centralized training and decentralized execution based on a centralized training and decentralized execution framework, for example, performing unmanned aerial vehicle action training and execution by using an MADDPG algorithm and a QMIX algorithm MADDPG framework, wherein a global observation value is used as training data in a training stage until a strategy network is converged; the execution stage is carried out in a distributed mode, namely each unmanned aerial vehicle sends the local observation value of the unmanned aerial vehicle to a strategy execution network to obtain corresponding actions; (5) to maintain the captain free from attacks, the captain is given an additional reward by a reward function for survival.

In the step (1), an unmanned aerial vehicle flight environment simulator based on aerodynamics is constructed based on a simulation environment engine.

In the step (3), each unmanned aerial vehicle acquires and maintains the local observation value of the unmanned aerial vehicle, encodes the local observation value of the unmanned aerial vehicle and sends the encoded local observation value to the captain; because the importance degrees of the information in the global observation value formed by the own observation values of the unmanned aerial vehicles are inconsistent, the captain respectively carries out attention mechanism processing on the global observation value according to the own local observation value of each unmanned aerial vehicle, determines the weight of the information according to the importance degree of the information, and then sends the calculated observation value to each teammate to serve as the global observation value of the teammate, so that the exploration space is reduced, the learning efficiency is improved, and the global observation value which is more closely related is constructed.

Meanwhile, in order to further reduce bandwidth consumption, each unmanned aerial vehicle observes a local observation value o of the unmanned aerial vehicle^ωCarrying out embedding coding processing, wherein each unmanned aerial vehicle shares the same coding mechanism, and teammates can obtain self observed values containing position, speed, attitude and state information after coding

And sending the observation values to the captain, and collecting the local observation values of the teammates by the captain to maintain the local observation values into a whole local observation value.

In the initial stage, the unmanned aerial vehicle information is deficient, and the unmanned aerial vehicle can be based on the local observed value o of the unmanned aerial vehicle^ωUsing a strategy of pi^ωGenerate a corresponding action a^ω。

In the training phase, in order to guarantee the safety of unmanned aerial vehicle cluster communication, a captain needs to survive to the end in the whole cooperation process, so as to design a specific reward function, wherein the reward function comprises: course reward function

Outcome reward function

Team leader reward function

The attention mechanism function contains three basic elements: query, key, value, where query represents a given element,<key,value>forming a data pair, which represents the value corresponding to each key; query represents the action performed by the unmanned aerial vehicle, Key represents the current state of each unmanned aerial vehicle, value corresponds to Key and is the number of KeyAnd presenting the value. First pass through the similarity function

Calculating the similarity between a given query and each key, and then passing through a softmax function

Obtaining the normalized Attention weight, and finally carrying out weighted summation on the normalized Attention weight to Attention (Q, K)_i)＝∑_iα_iValue_i，α_iRepresenting the weight of each original query. The weight attention mechanism of each element pay comprises a plurality of keys, alpha_iThe similarity between a certain key and a given query is calculated, and for calculating the final weight of the query, the weights of the query and each key need to be weighted and summed, so as to obtain an attribute value. Q stands for query, and the superscript T stands for transposing Q.

In the unmanned aerial vehicle flight environment simulator, an unmanned aerial vehicle cluster interacts with the environment to acquire training data. Each unmanned aerial vehicle acquires the local observed value, takes action according to the action strategy of the unmanned aerial vehicle, and obtains the reward value. Storing the obtained tuples composed of the global observation values, the actions and the rewards into an experience playback pool

In (1).

In the training stage, the Critic network of the evaluation network is centrally trained, and the joint Q value function of the Critic network of the evaluation network is defined as

Wherein

Is a parameter of the action strategy function, and the optimization goal is

Wherein

Is the target action of the next moment. Sampling partial samples from training data

The function is optimized until the model (Critic network) converges. r represents a bonus that is given,

representing the global reward, gamma is a loss factor, representing the impact of reward on the current action.

In the training stage, the recommendation strategy is trained through a gradient descent method to maximize the accumulated reward

T is the time range of the time,

representing the impact of reward on the current action, the smaller the value, the greater the value,

is time t reward.

The optimization target is as follows:

wherein

Representing the strategy under different roles, x, y and z represent the number of the unmanned aerial vehicle, and can be expanded into a plurality of numbers theta^ωRepresenting policy parameters of the drone.

In the execution stage, the local observation value of the local machine is sent to the strategy execution network to obtain the corresponding action.

Has the advantages that: compared with the prior art, the unmanned aerial vehicle cluster high-efficiency communication method based on multi-agent reinforcement learning provided by the invention has the advantages that the unmanned aerial vehicle obtains rewards from the environment by adopting a deep reinforcement learning algorithm, has autonomous decision making capability, and realizes higher unmanned aerial vehicle cluster flight control than the traditional rule-based control mode;

according to the invention, the cluster autonomous control of the unmanned aerial vehicle is realized through a multi-agent deep reinforcement learning algorithm, so that the problem caused by a centralized information interaction mode can be effectively solved, and the unmanned aerial vehicle has autonomous capability;

the invention strips the decision-making power of the control center in centralized information interaction, hands each unmanned aerial vehicle to make a decision autonomously, adopts a frame based on centralized training and decentralized execution, and effectively solves the problems of unmanned aerial vehicle cluster communication calculated amount, autonomy and the like by centralized training and distributed execution.

The invention carries out coding processing on the local observed value, effectively reduces the dimensionality of information, and thus can reduce bandwidth consumption; and attention mechanism processing is carried out on the global observation value, the importance degree of different information is reflected, the exploration space is reduced, and therefore the learning efficiency is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is a schematic diagram of interaction of a cluster of drones with an environment.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

First, the main process of the method of the present invention is briefly described with reference to FIG. 1. The experiment begins, constructs the unmanned aerial vehicle flight environment simulator, initializes the intelligent unmanned aerial vehicle, and then selects the captain and marks. And starting each unmanned aerial vehicle to acquire environmental information, then performing information sharing with the unmanned aerial vehicle cluster, and then judging whether the training information is enough. If the information is not enough, each unmanned aerial vehicle continues to acquire the environmental information, and if the information is enough, whether the information center contains the captain information is judged; if the captain has no captain confidence, reselecting the captain, and if the captain information is contained, training the model; and then judging whether the model converges, if not, reselecting the captain, and if so, finishing the training.

The process of interaction of the drone cluster with the environment is also described with respect to fig. 2. Every unmanned aerial vehicle interacts with the environment, unmanned aerial vehicle obtains environmental information, and exert the action in the environment, later unmanned aerial vehicle will obtain environmental information and carry out embedding embedded code, the team member sends the information after the code for the team leader, the team leader puts into the attention mechanism with training information, give information and add weight, later put into the model and train, the data deployment after the model will train is at the intelligent body (unmanned aerial vehicle promptly), the intelligent body interacts with the environment, carry out the training of next round, so the circulation.

In a specific embodiment, the method of the present invention is further detailed by means of pseudo code.

Algorithm 1 unmanned aerial vehicle cluster cooperation algorithm based on multi-agent reinforcement learning

Inputting: agent structure, attention network, environmental information, other model training parameters

And (3) outputting: structure and weight of each agent policy network

1: simulator constructed according to flight physical environment of unmanned aerial vehicle

2: initializing Agents

3: while policy model does not converge yet

4: selecting the queue length from each agent according to a certain strategy

5: while training information accumulation failing to reach threshold do

6: the intelligent agent interacts with the environment simulator to accumulate interaction information

7: each team member embedding self-accumulated interactive information

8: each team member sends the embedding data to the captain

9: the captain collects and processes the data, and gives different weights to the information through an attention mechanism

10：end while

11: checking whether the information related to the captain is legal

12: if captain information is illegal the

13: abandoning the training data accumulated by the iteration

14：continue

15：end if

16: training strategy model by using newly added data

17: according to the optimization objective

Training a Critic network in a centralized manner

18：end while

First, an environment simulator is constructed. And constructing an unmanned aerial vehicle flight environment simulator based on aerodynamics based on a simulation environment engine. This embodiment contains three cooperation unmanned aerial vehicles, can be convenient extend to many unmanned aerial vehicle clusters. A reward function is designed. The roles contained in the three cooperative unmanned aerial vehicles comprise a captain and team members, and in order to guarantee the safety of unmanned aerial vehicle cluster communication, the captain needs to survive to the end in the whole cooperation process, so that a specific reward function is designed as follows: course reward function

Outcome reward function

Team leader reward function

Secondly, the three cooperative unmanned aerial vehicles are initialized and numbered<x,y,z>Selecting one unmanned aerial vehicle as the captain and marking the captain as x_c. The following steps are carried out:

step 1:

in the training phase, the unmanned aerial vehicle x, y and z interact with the environment and transmit through the unmanned aerial vehicle x, y and zThe sensor (sensor combination) acquires a local observed value, then takes the observed value as the input of the embedding layer, and outputs the encoded observed value

Step 2:

the team leader collects the own observed values after the team friends encode, and integrates the own local observed values and the local observed values from the team friends into a global observed value

Performing attention mechanism processing on the global observation value and calculating the calculated global observation value

Sending the information to teammates;

and step 3:

designing a neural network structure, selecting a neural network hyper-parameter, and building a neural network.

For example, a policy network may include 5-layer fully-connected neural networks, each layer using a relu function as an activation function.

And 4, step 4:

and training a flight control strategy in the simulator by using the current flight control strategy according to a training process of a centralization training and decentralization execution frame until the model is converged.

And 5:

according to the execution flow of the centralization training decentralization execution framework, the obtained local observation value is sent to a strategy execution network (a model obtained after convergence), and a corresponding action is obtained.

In the unmanned aerial vehicle flight environment simulator, an unmanned aerial vehicle cluster interacts with the environment to acquire training data. Each unmanned aerial vehicle acquires local observed values and acts according to own action strategy

Taking action

Earning a prize value of

The global observation value, the action and the reward obtained above are combined into a tuple

Store to experience playback pool

In (1).

Indicating the environment information of the next moment.

In step 4, the Critic network is centrally trained, and the joint Q value function is defined as

Wherein

Is a parameter of the action strategy function, and the optimization goal is

Wherein

The function is optimized until the model converges.

Training recommendation strategy by gradient descent method to maximize cumulative rewards

The optimization target is as follows:

wherein

Representing policies under different roles.

Claims

1. An unmanned aerial vehicle cluster efficient communication method based on multi-agent reinforcement learning is characterized by comprising the following steps: (1) constructing an unmanned aerial vehicle flight environment simulator; (2) in the unmanned aerial vehicle cluster, randomly selecting one unmanned aerial vehicle as a team leader and marking, and taking the rest unmanned aerial vehicles as teammates; (3) the team leader is an observation value transfer station, collects local observation values of the team members and maintains the local observation values into a global observation value, and sends the global observation value to team friends for information interaction; (4) based on a frame of decentralized execution of centralized training, the training stage takes the global observation value as training data until the strategy network is converged; the execution stage is carried out in a distributed mode, namely each unmanned aerial vehicle sends the local observation value of the unmanned aerial vehicle to a strategy execution network to obtain corresponding actions; (5) to maintain the captain free from attacks, the captain is given an additional reward by a reward function for survival.

2. The multi-agent reinforcement learning-based unmanned aerial vehicle cluster high-efficiency communication method according to claim 1, wherein in the step (1), an aerodynamic-based unmanned aerial vehicle flight environment simulator is constructed based on a simulation environment engine.

3. The multi-agent reinforcement learning-based unmanned aerial vehicle cluster efficient communication method according to claim 1, wherein in the step (3), each unmanned aerial vehicle acquires and maintains local observation values of the unmanned aerial vehicle, encodes the local observation values of the unmanned aerial vehicle and transmits the local observation values to the captain; the team leader respectively carries out attention mechanism processing on the global observation values according to the local observation values of each unmanned aerial vehicle, determines the weight of the information according to the importance degree of the information, and then sends the calculated observation values to each team friend to serve as the global observation values of the team friends.

4. The multi-agent reinforcement learning-based unmanned aerial vehicle cluster high-efficiency communication method as claimed in claim 1, wherein each unmanned aerial vehicle is used for observing its own local observation value o^ωCarrying out embedding coding processing, wherein each unmanned aerial vehicle shares the same coding mechanism, and teammates can obtain self observed values containing position, speed, attitude and state information after coding

5. The multi-agent reinforcement learning-based unmanned aerial vehicle cluster high-efficiency communication method as claimed in claim 1, wherein in the initial stage, the unmanned aerial vehicle can perform local observation o according to itself^ωUsing a strategy of pi^ωGenerate a corresponding action a^ω。

6. The multi-agent reinforcement learning-based unmanned aerial vehicle cluster high-efficiency communication method as claimed in claim 1, wherein the captain needs to survive to the end in the whole unmanned aerial vehicle cooperation process so as to design a reward function, wherein the reward function comprises: course reward function

Outcome reward function

Team leader reward function

7. The multi-agent reinforcement learning-based unmanned aerial vehicle cluster high-efficiency communication method according to claim 3, wherein the attention mechanism function comprises three basic elements: query, key, value, first through a similarity function

Obtaining the normalized Attention weight, and finally carrying out weighted summation on the normalized Attention weight to Attention (Q, K)_i)＝∑_iα_iValue_i。

8. The multi-agent reinforcement learning-based unmanned aerial vehicle cluster high-efficiency communication method as claimed in claim 1, wherein in the unmanned aerial vehicle flight environment simulator, the unmanned aerial vehicle cluster interacts with the environment to obtain training data. Each unmanned aerial vehicle acquires a local observation value, takes action according to an action strategy of the unmanned aerial vehicle, and obtains a reward value; storing the obtained tuples composed of the global observation values, the actions and the rewards into an experience playback pool

In (1).

9. The multi-agent reinforcement learning-based unmanned aerial vehicle cluster efficient communication method as claimed in claim 1, wherein the Critic network is trained in a centralized manner, and its joint Q-value function is defined as

Wherein

Is a parameter of the action strategy function, and the optimization goal is

Wherein

The target action of the next moment; and sampling part of samples from the training data to perform function optimization until the model converges.

10. The multi-agent reinforcement learning-based unmanned aerial vehicle cluster high-efficiency communication method as claimed in claim 1, wherein gradient descent method is adopted to train recommendation strategy to maximize accumulated rewards

The optimization target is as follows:

wherein

Representing the policy under different roles, ω denotes the number of the drone.