CN113900445A

CN113900445A - Unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning

Info

Publication number: CN113900445A
Application number: CN202111193986.5A
Authority: CN
Inventors: 洪万福; 王旺
Original assignee: Xiamen Yuanting Information Technology Co ltd
Current assignee: Xiamen Yuanting Information Technology Co ltd
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2022-01-07

Abstract

The invention discloses an unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning. The method comprises the following steps: establishing a large-scale unmanned aerial vehicle cluster task model; establishing a Markov game model according to the task model; constructing a MADDPG algorithm neural network; the hyper-parameters of the neural network are adjusted, when training is conducted through the MADDPG algorithm, samples are collected from exploration environment experience and high-quality experience respectively according to certain probability, self state information and environment information of each unmanned aerial vehicle serve as input of the neural network, the speed of multiple unmanned aerial vehicles serves as output, training of a motion planning strategy is completed, and the multiple unmanned aerial vehicles can autonomously avoid obstacles and safely and quickly reach target positions in a complex environment. The method can improve the robustness of the strategy, train an excellent strategy with stronger adaptability and higher flexibility, and has good application prospect in the scene of multi-unmanned aerial vehicle collaborative motion planning.

Description

Unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning

Technical Field

The invention belongs to the technical field of artificial intelligence and unmanned aerial vehicles, and particularly relates to an unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning.

Background

In recent years, autonomous cluster unmanned combat becomes an exploration trend of intelligent military application, each unmanned platform can be regarded as an intelligent body, various unstable factors exist in the unmanned combat process, and the battlefield situation changes continuously, so that dynamic responses to the combat situation are different. Because the generalization performance of the supervised learning training model needing a large number of training samples is weak, and the deep reinforcement learning method only needs to react on the evaluation information of the operation effect of the current system, the deep reinforcement learning technology has higher instantaneity and robustness, and is more suitable for modeling the behavior of the intelligent game countermeasure.

The cooperative control of the unmanned cluster system comprises two aspects of coordination and cooperation. The purpose of coordination is to ensure that the multiple unmanned platforms do not conflict in the task execution process, and the problem of motion control among the multiple unmanned platforms is researched. The purpose of cooperation is to organize a plurality of unmanned platforms to jointly complete tasks, and research is carried out on high-level organization and decision-making mechanism problems. The unmanned cluster cooperative control relates to the structural design of an unmanned cluster system, the distributed control of the unmanned cluster and the like. By introducing the reinforcement learning technology, the autonomous distributed control of the unmanned platform has stronger adaptability and flexibility, the capability of completing combat missions by a single unmanned platform is improved, the coordination and cooperation of the unmanned cluster can be enhanced, and the overall performance of the cluster system is improved. In the unmanned cluster system, environmental information perceived by a single unmanned platform is local, so that a strategy obtained by a traditional single-agent reinforcement learning algorithm does not have universality. In order to solve the problem, the quantity of the intelligent agents is increased on the basis of single-intelligent-agent reinforcement learning, each intelligent agent has autonomy, purpose and coordination by introducing a distributed cooperative strategy mechanism, and the intelligent agent has learning, reasoning and self-organizing capabilities.

Disclosure of Invention

The invention aims to provide an unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning, so that cooperative decision making of large-scale unmanned aerial vehicles in execution of various complex tasks and environments is achieved, strategies of other agents are learned while the unmanned aerial vehicles in the large-scale unmanned aerial vehicles are trained and learned to act strategies, robustness of the strategies is improved, excellent strategies with higher adaptability and flexibility are trained, and the unmanned aerial vehicle cooperative control training method and system have good application prospects in scenes of multi-unmanned aerial vehicle cooperative control.

In order to achieve the above object, a first aspect of the present invention provides an unmanned aerial vehicle cooperative control training method based on multi-agent reinforcement learning, including:

step S1: establishing a task model of a large-scale unmanned aerial vehicle cluster;

step S2: establishing a Markov game model according to the task model;

step S3: constructing a MADDPG algorithm neural network;

step S4: training a MADDPG algorithm neural network;

step S5: loading the MADDPG algorithm neural network into an unmanned aerial vehicle cluster, executing unmanned aerial vehicle cluster cooperative control, and mapping actions output by the neural network into corresponding control instructions of the unmanned aerial vehicle.

Further, the step S1 specifically includes:

(1) task description: describing a cooperative task of an unmanned aerial vehicle cluster in a scene, wherein the cooperative task is that the unmanned aerial vehicle cluster needs to reach a designated destination in a certain time, and a building group and an obstacle exist in a certain range; all unmanned aerial vehicles in the unmanned aerial vehicle cluster are isomorphic and have the same performance parameters;

(2) and (3) environmental constraint:

initial coordinate constraint: in a scene, an unmanned aerial vehicle i is randomly generated in an initial area, and a target position and an obstacle position randomly appear in a certain distance of a target area; the distance d from the unmanned aerial vehicle i to the target area g at the initial moment_igSatisfies the following conditions:

d_i,g≥d_init

wherein d is_initEffective distance for successful completion of the task;

height and boundary constraints: its flying height satisfies the following constraints:

h_min≤h≤h_max

wherein h is_minAt the lowest flying height, h_maxIs the maximum flying height;

speed and acceleration constraints: in three-dimensional space, the speed and acceleration of the drone need to satisfy maximum constraints:

maximum yaw angle constraint: let the coordinate of the unmanned aerial vehicle track point i be (x)_i,y_i,z_i) Then the horizontal projection of the track segment from point i-1 to point i is α_i＝(x_i-x_i-1,y_i-y_i-1)^TThen the maximum yaw angle φ is constrained to be:

and (3) restraining the obstacles: the distance l between the unmanned aerial vehicle and the obstacle satisfies the following conditions:

l≥R_saft+l_min+R_UAV，

in the formula, R_saftIs a prescribed safe distance; l_minThe length of the barrier in the direction of the unmanned aerial vehicle; r_UAVIs the unmanned plane radius.

Further, the step S2 specifically includes:

(1) using quintuple<N,S,A,P,R>Representing a markov game model wherein: n ═ {1,2, …, N }, denotes a set of N drones; s is a combined state, S ═ S₁×s₂×…×s_nThe Cartesian product of the states of all drones, S therein_iRepresents the state of drone i; a is a combined action, A ═ a₁×a₂×…×a_nDenotes the Cartesian product of the actions of all drones, a of which_iRepresenting the action of the unmanned plane i; SxAxS → [0,1 → [ 1 ]]The unmanned aerial vehicle is a state transition model and represents the probability value of all unmanned aerial vehicles taking joint action to reach the next state in the current state; r is the joint reward, i.e. the Cartesian product of all the drone reward functions, R ═ R₁×R₂×…×R_nWherein R is_iIndicates unmanned plane i anda reward value obtained by the environmental interaction;

(2) setting a state space of the unmanned aerial vehicle, and setting the state space of each unmanned aerial vehicle in a polar coordinate system; taking the center of the unmanned aerial vehicle i as an origin, taking the direction from the unmanned aerial vehicle i to a target of the unmanned aerial vehicle i as a positive direction to establish a polar coordinate system, and expressing the state of the unmanned aerial vehicle i as follows: s_i＝(s,s_U,s_E) Wherein s ═ P (P)_ix,P_iy,P_igx,P_igy) Position information for drone i and target, P_ix，P_iyPosition information for drone i, P_igx，P_igyPosition information of a target of the unmanned aerial vehicle i; s_U＝(P_jx,P_jy) The position information of the obstacle closest to the unmanned aerial vehicle i in the communication range of the unmanned aerial vehicle i is shown, and if no other obstacle exists in the communication range, S is carried out_E＝(0,0)；

(3) Setting the action space of the unmanned aerial vehicle, and regarding the unmanned aerial vehicle i, the action space is a_i＝(ω_it)，ω_itThe angular velocity value of the unmanned aerial vehicle i at the moment t;

(4) setting a reward function of the unmanned aerial vehicle; the reward function of drone i is specifically set as follows:

R₁＝10+R_it，

R₂＝-20，

R₃＝-2^|α|+l-τ，

R₄＝-2^|α|+1，

R_i＝ω₁R₁+ω₂R₂+ω₃R₃+ω₄R₄，

wherein R is₁Representing the value of the reward when the drone reaches the target,

penalty, W, representing the time spent by the drone reaching the target_tAs a penalty factor, T_iA specific value of time spent by the drone to reach the target location,

indicating unmanned plane with u_iThe shortest time required for the linear velocity of (a) to reach the target position along a straight line, P_ioAnd P_igRespectively an initial position and a target position of the unmanned aerial vehicle; r₂The penalty value is the collision penalty value of the unmanned aerial vehicle; r₃For collision early warning, selecting a barrier or other unmanned aerial vehicles closest to the unmanned aerial vehicle within a communication range of a communication distance l as a dangerous object, and giving corresponding punishment when the Euclidean distance between the unmanned aerial vehicle and the dangerous object in front at the current moment is smaller than the Euclidean distance between the unmanned aerial vehicle and the dangerous object in front at the last moment; r₄The punishment degree is increased along with the increase of the target angular speed alpha of the unmanned aerial vehicle for the dense return function of the unmanned aerial vehicle; reward R of unmanned aerial vehicle i_iFrom R₁ R₂ R₃ R₄Subject to different weights omega₁ ω₂ ω₃ ω₄And weighted summation is carried out.

Further, the step S3 specifically includes:

(1) constructing a policy network in the MADDPG algorithm: policy network mu for drone i_iComposed of an input layer, a hidden layer and an output layer, the input is a state vector s of an unmanned aerial vehicle i_iAnd the output is the action vector a of the unmanned plane i_i＝μ_i(s_i)；

(2) Constructing an evaluation network of the MADDPG algorithm: evaluation network of unmanned aerial vehicle i

The unmanned aerial vehicle is composed of an input layer, a hidden layer and an output layer, wherein the input is the state vector x(s) of all unmanned aerial vehicles₁,…,s_n) And all the unmanned planes obtain the action a according to respective policy networks₁,…,a_nThe output is the action value function of the unmanned aerial vehicle i, and is a centralized action value function

(3) Constructing a target neural network: for drone i, policy network μ_iAnd evaluating the network

Are copied into the respective corresponding target network, i.e.

Wherein

Respectively representing parameters of the current policy network and the evaluation network,

respectively representing parameters of the target policy network and the target evaluation network.

Further, the step S4 specifically includes:

(1) initializing parameters of all networks

Emptying respective experience playback sets;

(2) setting the total training round number and starting iteration;

(3) for each drone, at the current policy network based on state s_iTo obtain

(4) Performing action a for each drone_iTo obtain a new state s'_iAnd respective prizes R_iAnd will be (s, s', a)₁,…a_n,r₁,…r_n) Adding the experience playback set;

(5) for each drone, starting to update the network by sampling M samples from the empirical playback set M;

(6) calculating the best action taken at the next moment through the target policy network

(7) Computing approximate trues through a target evaluation networkActual value, input as status and action, output as

(8) To be provided with

Updating the current evaluation network as a loss function;

(9) by passing

Updating the current strategy network;

(10) if the iteration times reach the updating frequency of the network parameters, updating the parameters of the target evaluation network and the target strategy network:

the updating mode is soft updating, wherein theta is a soft updating proportionality coefficient.

Further, the hyper-parameters of the neural network in step S4 include:

a strategy network and an evaluation network adopt a fully-connected neural network and are trained by adopting an Adam optimizer;

setting basic parameters of a policy network and an evaluation network, wherein the basic parameters comprise: the number of hidden layers, the activation function, the learning rate, the number of samples for batch updating and each reward weight in the reward function.

Further, the policy network is configured to: the unmanned aerial vehicle action learning method comprises two hidden layers, wherein activation functions of the hidden layers are relu functions, the first layer is 64 nodes, the second layer is 32 nodes, the output layer is 1 node, namely, actions taken by the unmanned aerial vehicle, the activation function adopted by the output layer is a tanh function, and the learning rate of a strategy network is 0.001; the evaluation network is set as: the method comprises two hidden layers, wherein activation functions of the hidden layers are relu functions, the number of nodes is 64, an output layer is 1 node, namely a Q value function, the activation functions of the output layer are linear functions y which are x + b, b is a bias parameter, and the learning rate of an evaluation network is 0.0001; the number of samples for batch updating by the random sampling experience is set to be N-128.

The second aspect of the present invention provides an unmanned aerial vehicle cooperative control training system based on multi-agent reinforcement learning, which is used for implementing the unmanned aerial vehicle cooperative control training method based on multi-agent reinforcement learning according to any technical scheme of the first aspect of the present invention, and the method includes:

the task model data acquisition module is used for preprocessing data of the multi-unmanned-aerial-vehicle environment in a task, coding the observation space and the global state space of each unmanned aerial vehicle in the environment and converting the coded observation space and the global state space into vector features which can be identified by a neural network;

the neural network construction module is used for constructing the MADDPG neural network according to the task model and receiving the vector characteristics transmitted by the task model data acquisition module;

the parameter adjusting module is used for setting the hyper-parameters of the neural network, wherein the hyper-parameters comprise the number of hidden layers, an activation function, a learning rate, the number of samples updated in batches and reward weight in a reward function;

and the main control unit is used for loading the MADDPG algorithm neural network into the unmanned aerial vehicle cluster, executing the unmanned aerial vehicle cluster cooperative control, and mapping the action output by the neural network into a corresponding control instruction of the unmanned aerial vehicle.

The unmanned aerial vehicle cooperative control training method based on multi-agent reinforcement learning improves the robustness of strategies, can train excellent strategies with stronger adaptability and higher flexibility, and has good application prospect in the scene of multi-unmanned aerial vehicle cooperative control.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of the system of the present invention.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.

The invention will now be further described with reference to the accompanying drawings and detailed description.

As shown in fig. 1, the present invention provides an unmanned aerial vehicle cooperative control training method based on multi-agent reinforcement learning, which includes the following steps:

the method comprises the following steps: and establishing a large-scale unmanned aerial vehicle cluster task model.

The method specifically comprises the following steps: set unmanned aerial vehicle as the circular shape intelligent body, unmanned aerial vehicle i's radius is r_i(ii) a Setting the shape of the obstacle as a circle and the radius of the obstacle as r_oAnd the collision distance between the unmanned aerial vehicle and the barrier is D_io＝r_i+r_o(ii) a The target position of the unmanned aerial vehicle i is a circular space with the radius r_igWhen the unmanned aerial vehicle i contacts the target range, namely the distance D between the central position of the unmanned aerial vehicle i and the central position of the target range_io≤r_i+r_oJudging that the unmanned aerial vehicle i successfully reaches the target position;

position setting of unmanned aerial vehicle i is P_i＝[x_i,y_i]^TThe communication distance of the drone is denoted L_cThe communication range of the unmanned aerial vehicle takes the center of the unmanned aerial vehicle as the circle center, and L_cIs a circle with a radius; in the communication range of the unmanned aerial vehicle, the unmanned aerial vehicle can sense other unmanned aerial vehicles or obstacle information.

Step two: and establishing a Markov game model according to the task model.

The method specifically comprises the following steps:

(1) quintuple for representation of Markov game model<N,S,A,P,R>Each component is specifically explained as follows: n ═ 1,2,. and N, representing a set of N drones; s is a combined state, S ═ S₁×s₂×…×s_nIs shown byCartesian product of the states of all drones, where S_iRepresents the state of drone i; a is a combined action, A ═ a₁×a₂×...×a_nThe Cartesian product, a, representing the motion of all drones_iRepresenting the action of the unmanned plane i;

P:S×A×S→[0,1]the unmanned aerial vehicle is a state transition model and represents the probability value of all unmanned aerial vehicles taking joint action to reach the next state in the current state; for joint rewards, i.e. the Cartesian product of all the reward functions of the drone, R ═ R₁×R₂×...×R_nWherein R is_iRepresenting a reward value obtained by interaction of the unmanned aerial vehicle i with the environment;

(3) Setting the action space of the unmanned aerial vehicle, and regarding the unmanned aerial vehicle i, the action space is a_i＝(ω_it)，ω_itFor the angular velocity value of unmanned aerial vehicle i at moment t, because the flight restriction of unmanned aerial vehicle and the restriction of barrier, the optional action of different moments is different, and unmanned aerial vehicle can only select the action from current action space.

R₁＝10+R_it，

R₂＝-20，

R₃＝-2^|α|+l-τ，

R₄＝-2^|α|+1，

R_i＝ω₁R₁+ω₂R₂+ω₃R₃+ω₄R₄，

indicating unmanned plane with u_iThe shortest time required for the linear velocity of (a) to reach the target position along a straight line, P_ioAnd P_igRespectively an initial position and a target position of the unmanned aerial vehicle; r₂The penalty value is the collision penalty value of the unmanned aerial vehicle; r₃For collision early warning, selecting a barrier or other unmanned aerial vehicles closest to the unmanned aerial vehicle within a communication range of a communication distance l as a dangerous object, and giving corresponding punishment when the Euclidean distance between the unmanned aerial vehicle and the dangerous object in front at the current moment is smaller than the Euclidean distance between the unmanned aerial vehicle and the dangerous object in front at the last moment; r₄The punishment degree is increased along with the increase of the target angular speed alpha of the unmanned aerial vehicle for the dense return function of the unmanned aerial vehicle; reward R of unmanned aerial vehicle i_iFrom R₁ R₂ R₃ R₄Subject to different weights omega₁ω₂ω₃ω₄And weighted summation is carried out.

Step three: constructing a MADDPG (Multi agent reinforcement learning) algorithm neural network.

The method specifically comprises the following steps:

(1) constructing a policy network (Actor) in the MADDPG algorithm: policy network mu for drone i_iComposed of an input layer, a hidden layer and an output layer, the input is a state vector s of an unmanned aerial vehicle i_iAnd the output is the action vector a of the unmanned plane i_i＝μ_i(s_i)；

(2) Constructing an evaluation network (Critic) of the maddppg algorithm: evaluation network of unmanned aerial vehicle i

Are copied into the respective corresponding target network, i.e.

Wherein

Step four: training the maddppg algorithm neural network.

The method specifically comprises the following steps:

(1) initializing parameters of all networks

Emptying respective experience playback sets;

(2) setting the total training round number and starting iteration;

(3) for each drone, the current policy network is based onState s_iTo obtain

(7) Calculating approximate real value through a target evaluation network, wherein the input is state and action, and the output is

(8) To be provided with

Updating the current evaluation network as a loss function;

(9) by passing

Updating the current strategy network;

During the training process, the neural network hyper-parameters need to be set.

The method specifically comprises the following steps:

setting parameters such as the number of hidden layers, an activation function, a learning rate, the number of samples updated in batches and the like of the strategy network and the evaluation network, and adjusting the reward weight in the reward function.

An example set of parameter settings is given below:

the strategy network comprises two hidden layers, the activation functions of the hidden layers are relu functions, the first layer is 64 nodes, the second layer is 32 nodes, the output layer is 1 node, namely the action taken by the unmanned aerial vehicle, the activation function adopted by the output layer is a tanh function, and the learning rate of the strategy network is 0.001.

The evaluation network also comprises two hidden layers, the activation functions of the hidden layers are relu functions, the number of nodes is 64, the output layer is 1 node, namely a Q value function, the activation functions of the output layer are linear functions, y is x + b, b is a bias parameter, and the learning rate of the evaluation network is 0.0001. The number of samples for batch updating by the random sampling experience is set to be N-128.

Step six: loading the MADDPG algorithm neural network into an unmanned aerial vehicle cluster, executing unmanned aerial vehicle cluster cooperative control, and mapping actions output by the neural network into corresponding control instructions of the unmanned aerial vehicle. The method specifically comprises the following steps: and loading the stored strategy network and evaluation network parameter data into the unmanned aerial vehicle cluster, so that the multiple unmanned aerial vehicles execute flight actions according to the trained network, and completing a large-scale unmanned aerial vehicle motion planning task.

As shown in fig. 2, the present invention further provides a system for implementing the method according to the foregoing embodiment, including:

the task model data acquisition module 10 is used for preprocessing data of the multi-unmanned-aerial-vehicle environment in a task, coding the observation space and the global state space of each unmanned aerial vehicle in the environment, and converting the coded observation space and the global state space into vector features which can be identified by a neural network;

the neural network building module 20 is used for building the MADDPG neural network according to the task model, setting hidden layer dimensionality and receiving coding information from the environment;

the parameter adjusting module 30 is configured to set a hyper-parameter of the neural network, and includes: setting different hidden layer numbers aiming at a network structure, replacing different activation functions, controlling the learning rate of the network and setting the number of samples to be updated in batches; the reward weight in the reward function can be adjusted to improve the cooperative control effect;

and the main control unit 40 is used for loading the MADDPG algorithm neural network into the unmanned aerial vehicle cluster, executing the unmanned aerial vehicle cluster cooperative control, and mapping the action output by the neural network into a corresponding control instruction of the unmanned aerial vehicle.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept and the scope of the appended claims is intended to be protected.

Claims

1. An unmanned aerial vehicle cooperative control training method based on multi-agent reinforcement learning is characterized by comprising the following steps:

step S2: establishing a Markov game model according to the task model;

step S3: constructing a MADDPG algorithm neural network;

step S4: training a MADDPG algorithm neural network;

2. The cooperative drone control training method based on multi-agent reinforcement learning of claim 1, wherein the step S1 specifically includes:

(2) and (3) environmental constraint:

initial coordinate constraint: in the scene, an unmanned aerial vehicle i randomly generates a target in an initial regionThe position and the position of the obstacle randomly appear in a certain distance of the target area; the distance d from the unmanned aerial vehicle i to the target area g at the initial moment_igSatisfies the following conditions:

d_i,g≥d_init

wherein d is_initEffective distance for successful completion of the task;

h_min≤h≤h_max

wherein h is_minAt the lowest flying height, h_maxIs the maximum flying height;

|v_x,y,z|≤v_maxx,y,z，

|a_x,y,z|≤a_maxx,y,z；

l≥R_saft+l_min+R_UAV

3. The cooperative drone control training method based on multi-agent reinforcement learning of claim 1, wherein the step S2 specifically includes:

(1) miningUsing quintuple<N,S,A,P,R>Representing a markov game model wherein: n ═ 1,2,. and N, representing a set of N drones; s is a combined state, S ═ S₁×s₂×...×s_nThe Cartesian product of the states of all drones, S therein_iRepresents the state of drone i; a is a combined action, A ═ a₁×a₂×...×a_nDenotes the Cartesian product of the actions of all drones, a of which_iRepresenting the action of the unmanned plane i; SxAxS → [0,1 → [ 1 ]]The unmanned aerial vehicle is a state transition model and represents the probability value of all unmanned aerial vehicles taking joint action to reach the next state in the current state; r is the joint reward, i.e. the Cartesian product of all the drone reward functions, R ═ R₁×R₂×...×R_nWherein R is_iRepresenting a reward value obtained by interaction of the unmanned aerial vehicle i with the environment;

(4) and setting the reward function of the unmanned aerial vehicle.

4. The cooperative unmanned aerial vehicle control training method based on multi-agent reinforcement learning of claim 3, wherein the reward function of the unmanned aerial vehicle i in the step (4) is specifically set as follows:

R₁＝10+R_it，

R₂＝-20，

R₃＝-2^|α|+l-τ，

R₄＝-2^|α|+1，

R_i＝ω₁R₁+ω₂R₂+ω₃R₃+ω₄R₄，

indicating unmanned plane with u_iThe shortest time required for the linear velocity of (a) to reach the target position along a straight line, P_ioAnd P_igRespectively an initial position and a target position of the unmanned aerial vehicle; r₂The penalty value is the collision penalty value of the unmanned aerial vehicle; r₃For collision early warning, selecting a barrier or other unmanned aerial vehicles closest to the unmanned aerial vehicle within a communication range of a communication distance l as a dangerous object, and giving corresponding punishment when the Euclidean distance between the unmanned aerial vehicle and the dangerous object in front at the current moment is smaller than the Euclidean distance between the unmanned aerial vehicle and the dangerous object in front at the last moment; r₄The penalty degree is increased with the increase of the target angular velocity alpha of the unmanned aerial vehicle, which is a dense reward function of the unmanned aerial vehicle.

5. The cooperative drone control training method based on multi-agent reinforcement learning of claim 1, wherein the step S3 specifically includes:

(1) constructing a policy network in the MADDPG algorithm: policy network mu for drone i_iComposed of an input layer, a hidden layer and an output layerComposition of the layer, input is the state vector s of unmanned aerial vehicle i_iAnd the output is the action vector a of the unmanned plane i_i＝μ_i(s_i)；

Are copied into the respective corresponding target network, i.e.

Wherein

6. The cooperative drone control training method based on multi-agent reinforcement learning as claimed in claim 5, wherein the step S4 specifically includes:

(1) initializing parameters of all networks

Emptying respective experience playback sets;

(2) setting the total training round number and starting iteration;

(3) for each drone, at the current policy network based on state s_iTo obtain

(8) To be provided with

Updating the current evaluation network as a loss function;

(9) by passing

Updating the current strategy network;

the update mode is soft update, wherein

The scaling factor is updated soft.

7. The cooperative drone control training method based on multi-agent reinforcement learning of claim 1, wherein the hyper-parameters of the neural network in the step S4 include:

8. The multi-agent reinforcement learning-based unmanned aerial vehicle cooperative control training method of claim 7, wherein the strategy network is set to: the unmanned aerial vehicle action learning method comprises two hidden layers, wherein activation functions of the hidden layers are relu functions, the first layer is 64 nodes, the second layer is 32 nodes, the output layer is 1 node, namely, actions taken by the unmanned aerial vehicle, the activation function adopted by the output layer is a tanh function, and the learning rate of a strategy network is 0.001; the evaluation network is set as: the method comprises two hidden layers, wherein activation functions of the hidden layers are relu functions, the number of nodes is 64, an output layer is 1 node, namely a Q value function, the activation functions of the output layer are linear functions y which are x + b, b is a bias parameter, and the learning rate of an evaluation network is 0.0001; the number of samples for batch updating by the random sampling experience is set to be N-128.

9. An unmanned aerial vehicle cooperative control training system based on multi-agent reinforcement learning, which is used for realizing the unmanned aerial vehicle cooperative control training method based on multi-agent reinforcement learning of any one of claims 1 to 8, and comprises the following steps: